[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CAJD7tkbksw1GquBout-JnLg=kCXmK6OUTvpu+8v7eyKQFUqsyQ@mail.gmail.com>
Date: Wed, 4 Sep 2024 14:29:08 -0700
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Davidlohr Bueso <dave@...olabs.net>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org, mhocko@...nel.org,
rientjes@...gle.com, hannes@...xchg.org, almasrymina@...gle.com,
roman.gushchin@...ux.dev, gthelen@...gle.com, dseo3@....edu,
a.manzanares@...sung.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH -next] mm: introduce per-node proactive reclaim interface
On Wed, Sep 4, 2024 at 9:28 AM Davidlohr Bueso <dave@...olabs.net> wrote:
>
> This adds support for allowing proactive reclaim in general on a
> NUMA system. A per-node interface extends support for beyond a
> memcg-specific interface, respecting the current semantics of
> memory.reclaim: respecting aging LRU and not supporting
> artificially triggering eviction on nodes belonging to non-bottom
> tiers.
>
> This patch allows userspace to do:
>
> echo 512M swappiness=10 > /sys/devices/system/node/nodeX/reclaim
>
> One of the premises for this is to semantically align as best as
> possible with memory.reclaim. During a brief time memcg did
> support nodemask until 55ab834a86a9 (Revert "mm: add nodes=
> arg to memory.reclaim"), for which semantics around reclaim
> (eviction) vs demotion were not clear, rendering charging
> expectations to be broken.
>
> With this approach:
>
> 1. Users who do not use memcg can benefit from proactive reclaim.
>
> 2. Proactive reclaim on top tiers will trigger demotion, for which
> memory is still byte-addressable. Reclaiming on the bottom nodes
> will trigger evicting to swap (the traditional sense of reclaim).
> This follows the semantics of what is today part of the aging process
> on tiered memory, mirroring what every other form of reclaim does
> (reactive and memcg proactive reclaim). Furthermore per-node proactive
> reclaim is not as susceptible to the memcg charging problem mentioned
> above.
>
> 3. Unlike memcg, there should be no surprises of callers expecting
> reclaim but instead got a demotion. Essentially relying on behavior
> of shrink_folio_list() after 6b426d071419 (mm: disable top-tier
> fallback to reclaim on proactive reclaim), without the expectations
> of try_to_free_mem_cgroup_pages().
>
> 4. Unlike the nodes= arg, this interface avoids confusing semantics,
> such as what exactly the user wants when mixing top-tier and low-tier
> nodes in the nodemask. Further per-node interface is less exposed to
> "free up memory in my container" usecases, where eviction is intended.
>
> 5. Users that *really* want to free up memory can use proactive reclaim
> on nodes knowingly to be on the bottom tiers to force eviction in a
> natural way - higher access latencies are still better than swap.
> If compelled, while no guarantees and perhaps not worth the effort,
> users could also also potentially follow a ladder-like approach to
> eventually free up the memory. Alternatively, perhaps an 'evict' option
> could be added to the parameters for both memory.reclaim and per-node
> interfaces to force this action unconditionally.
>
> Signed-off-by: Davidlohr Bueso <dave@...olabs.net>
> ---
>
> This topic has been brought up in the past without much resolution.
> But today, I believe a number of semantics and expectations have become
> clearer (per the changelog), which could merit revisiting this.
>
> Documentation/ABI/stable/sysfs-devices-node | 11 ++
> drivers/base/node.c | 2 +
> include/linux/swap.h | 16 ++
> mm/vmscan.c | 154 ++++++++++++++++----
> 4 files changed, 156 insertions(+), 27 deletions(-)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 402af4b2b905..5d69ee956cf9 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -221,3 +221,14 @@ Contact: Jiaqi Yan <jiaqiyan@...gle.com>
> Description:
> Of the raw poisoned pages on a NUMA node, how many pages are
> recovered by memory error recovery attempt.
> +
> +What: /sys/devices/system/node/nodeX/reclaim
> +Date: September 2024
> +Contact: Linux Memory Management list <linux-mm@...ck.org>
> +Description:
> + This is write-only nested-keyed file which accepts the number of
> + bytes to reclaim as well as the swappiness for this particular
> + operation. Write the amount of bytes to induce memory reclaim in
> + this node. When it completes successfully, the specified amount
> + or more memory will have been reclaimed, and -EAGAIN if less
> + bytes are reclaimed than the specified amount.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index eb72580288e6..d8ed19f8565b 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -626,6 +626,7 @@ static int register_node(struct node *node, int num)
> } else {
> hugetlb_register_node(node);
> compaction_register_node(node);
> + reclaim_register_node(node);
> }
>
> return error;
> @@ -642,6 +643,7 @@ void unregister_node(struct node *node)
> {
> hugetlb_unregister_node(node);
> compaction_unregister_node(node);
> + reclaim_unregister_node(node);
> node_remove_accesses(node);
> node_remove_caches(node);
> device_unregister(&node->dev);
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 248db1dd7812..456e3aedb964 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -423,6 +423,22 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
> extern int vm_swappiness;
> long remove_mapping(struct address_space *mapping, struct folio *folio);
>
> +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> +extern int reclaim_register_node(struct node *node);
> +extern void reclaim_unregister_node(struct node *node);
> +
> +#else
> +
> +static inline int reclaim_register_node(struct node *node)
> +{
> + return 0;
> +}
> +
> +static inline void reclaim_unregister_node(struct node *node)
> +{
> +}
> +#endif /* CONFIG_SYSFS && CONFIG_NUMA */
> +
> #ifdef CONFIG_NUMA
> extern int node_reclaim_mode;
> extern int sysctl_min_unmapped_ratio;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5dc96a843466..56ddf54366e4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -56,6 +56,7 @@
> #include <linux/khugepaged.h>
> #include <linux/rculist_nulls.h>
> #include <linux/random.h>
> +#include <linux/parser.h>
>
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> @@ -92,10 +93,8 @@ struct scan_control {
> unsigned long anon_cost;
> unsigned long file_cost;
>
> -#ifdef CONFIG_MEMCG
> /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> int *proactive_swappiness;
> -#endif
>
> /* Can active folios be deactivated as part of reclaim? */
> #define DEACTIVATE_ANON 1
> @@ -266,6 +265,9 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>
> static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> {
> + if (sc->proactive && sc->proactive_swappiness)
> + return *sc->proactive_swappiness;
> +
This code is already upstream, right?
> return READ_ONCE(vm_swappiness);
> }
> #endif
> @@ -7470,36 +7472,28 @@ static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
> /*
> * Try to free up some pages from this node through reclaim.
> */
> -static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
> +static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask,
> + unsigned long nr_pages, struct scan_control *sc)
> {
> - /* Minimum pages needed in order to stay on node */
> - const unsigned long nr_pages = 1 << order;
> struct task_struct *p = current;
> unsigned int noreclaim_flag;
> - struct scan_control sc = {
> - .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> - .gfp_mask = current_gfp_context(gfp_mask),
> - .order = order,
> - .priority = NODE_RECLAIM_PRIORITY,
> - .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
> - .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
> - .may_swap = 1,
> - .reclaim_idx = gfp_zone(gfp_mask),
> - };
> unsigned long pflags;
>
> - trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order,
> - sc.gfp_mask);
> + trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, sc->order,
> + sc->gfp_mask);
>
> cond_resched();
> - psi_memstall_enter(&pflags);
> +
> + if (!sc->proactive)
> + psi_memstall_enter(&pflags);
> +
> delayacct_freepages_start();
> - fs_reclaim_acquire(sc.gfp_mask);
> + fs_reclaim_acquire(sc->gfp_mask);
> /*
> * We need to be able to allocate from the reserves for RECLAIM_UNMAP
> */
> noreclaim_flag = memalloc_noreclaim_save();
> - set_task_reclaim_state(p, &sc.reclaim_state);
> + set_task_reclaim_state(p, &sc->reclaim_state);
>
> if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages ||
> node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B) > pgdat->min_slab_pages) {
> @@ -7508,24 +7502,38 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
> * priorities until we have enough memory freed.
> */
> do {
> - shrink_node(pgdat, &sc);
> - } while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
> + shrink_node(pgdat, sc);
> + } while (sc->nr_reclaimed < nr_pages && --sc->priority >= 0);
> }
>
> set_task_reclaim_state(p, NULL);
> memalloc_noreclaim_restore(noreclaim_flag);
> - fs_reclaim_release(sc.gfp_mask);
> - psi_memstall_leave(&pflags);
> + fs_reclaim_release(sc->gfp_mask);
> delayacct_freepages_end();
>
> - trace_mm_vmscan_node_reclaim_end(sc.nr_reclaimed);
> + if (!sc->proactive)
> + psi_memstall_leave(&pflags);
> +
> + trace_mm_vmscan_node_reclaim_end(sc->nr_reclaimed);
>
> - return sc.nr_reclaimed >= nr_pages;
> + return sc->nr_reclaimed;
> }
>
> int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
> {
> int ret;
> + /* Minimum pages needed in order to stay on node */
> + const unsigned long nr_pages = 1 << order;
> + struct scan_control sc = {
> + .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> + .gfp_mask = current_gfp_context(gfp_mask),
> + .order = order,
> + .priority = NODE_RECLAIM_PRIORITY,
> + .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
> + .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
> + .may_swap = 1,
> + .reclaim_idx = gfp_zone(gfp_mask),
> + };
>
> /*
> * Node reclaim reclaims unmapped file backed pages and
> @@ -7560,7 +7568,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
> if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
> return NODE_RECLAIM_NOSCAN;
>
> - ret = __node_reclaim(pgdat, gfp_mask, order);
> + ret = __node_reclaim(pgdat, gfp_mask, nr_pages, &sc) >= nr_pages;
> clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
>
> if (ret)
> @@ -7617,3 +7625,95 @@ void check_move_unevictable_folios(struct folio_batch *fbatch)
> }
> }
> EXPORT_SYMBOL_GPL(check_move_unevictable_folios);
> +
> +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> +
> +enum {
> + MEMORY_RECLAIM_SWAPPINESS = 0,
> + MEMORY_RECLAIM_NULL,
> +};
> +
> +static const match_table_t tokens = {
> + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
> + { MEMORY_RECLAIM_NULL, NULL },
> +};
> +
> +static ssize_t reclaim_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + int nid = dev->id;
> + gfp_t gfp_mask = GFP_KERNEL;
> + struct pglist_data *pgdat = NODE_DATA(nid);
> + unsigned long nr_to_reclaim, nr_reclaimed = 0;
> + unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> + int swappiness = -1;
> + char *old_buf, *start;
> + substring_t args[MAX_OPT_ARGS];
> + struct scan_control sc = {
> + .gfp_mask = current_gfp_context(gfp_mask),
> + .reclaim_idx = gfp_zone(gfp_mask),
> + .priority = DEF_PRIORITY,
> + .may_writepage = !laptop_mode,
> + .may_unmap = 1,
> + .may_swap = 1,
> + .proactive = 1,
> + };
> +
> + buf = strstrip((char *)buf);
> +
> + old_buf = (char *)buf;
> + nr_to_reclaim = memparse(buf, (char **)&buf) / PAGE_SIZE;
> + if (buf == old_buf)
> + return -EINVAL;
> +
> + buf = strstrip((char *)buf);
> +
> + while ((start = strsep((char **)&buf, " ")) != NULL) {
> + if (!strlen(start))
> + continue;
> + switch (match_token(start, tokens, args)) {
> + case MEMORY_RECLAIM_SWAPPINESS:
> + if (match_int(&args[0], &swappiness))
> + return -EINVAL;
> + if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
> + return -EINVAL;
> + break;
> + default:
> + return -EINVAL;
> + }
> + }
> +
> + sc.nr_to_reclaim = max(nr_to_reclaim, SWAP_CLUSTER_MAX);
> + while (nr_reclaimed < nr_to_reclaim) {
> + unsigned long reclaimed;
> +
> + if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
> + return -EAGAIN;
Can the PGDAT_RECLAIM_LOCKED check be moved into __node_reclaim()?
They are duplicated in node_reclaim().
> +
> + /* does cond_resched() */
> + reclaimed = __node_reclaim(pgdat, gfp_mask,
> + nr_to_reclaim - nr_reclaimed, &sc);
> +
> + clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
> +
> + if (!reclaimed && !nr_retries--)
> + break;
> +
> + nr_reclaimed += reclaimed;
> + }
In the memcg code (i.e. memory_reclaim()) we also check for pending
signals, and drain the LRUs before the last iteration. Do we need this
here as well?
This leads to my next question: there is a lot of common code with
memory_reclaim(). Should we refactor some of it? At least the
arguments parsing part looks almost identical.
> +
> + return nr_reclaimed < nr_to_reclaim ? -EAGAIN : count;
> +}
> +
> +static DEVICE_ATTR_WO(reclaim);
> +int reclaim_register_node(struct node *node)
> +{
> + return device_create_file(&node->dev, &dev_attr_reclaim);
> +}
> +
> +void reclaim_unregister_node(struct node *node)
> +{
> + return device_remove_file(&node->dev, &dev_attr_reclaim);
> +}
> +#endif
> --
> 2.39.2
>
Powered by blists - more mailing lists