[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f004c237-735d-4ff8-a6c9-2dc25c32637c@linux.ibm.com>
Date: Fri, 27 Dec 2024 16:31:48 +0530
From: Donet Tom <donettom@...ux.ibm.com>
To: Gregory Price <gourry@...rry.net>, linux-mm@...ck.org
Cc: linux-kernel@...r.kernel.org, nehagholkar@...a.com, abhishekd@...a.com,
kernel-team@...a.com, david@...hat.com, nphamcs@...il.com,
akpm@...ux-foundation.org, hannes@...xchg.org, kbusch@...a.com,
ying.huang@...ux.alibaba.com
Subject: Re: [RFC v2 PATCH 5/5] migrate,sysfs: add pagecache promotion
On 12/11/24 03:07, Gregory Price wrote:
> adds /sys/kernel/mm/numa/pagecache_promotion_enabled
>
> When page cache lands on lower tiers, there is no way for promotion
> to occur unless it becomes memory-mapped and exposed to NUMA hint
> faults. Just adding a mechanism to promote pages unconditionally,
> however, opens up significant possibility of performance regressions.
>
> Similar to the `demotion_enabled` sysfs entry, provide a sysfs toggle
> to enable and disable page cache promotion. This option will enable
> opportunistic promotion of unmapped page cache during syscall access.
>
> This option is intended for operational conditions where demoted page
> cache will eventually contain memory which becomes hot - and where
> said memory likely to cause performance issues due to being trapped on
> the lower tier of memory.
>
> A Page Cache folio is considered a promotion candidates when:
> 0) tiering and pagecache-promotion are enabled
> 1) the folio reside on a node not in the top tier
> 2) the folio is already marked referenced and active.
> 3) Multiple accesses in (referenced & active) state occur quickly.
>
> Since promotion is not safe to execute unconditionally from within
> folio_mark_accessed, we defer promotion to a new task_work captured
> in the task_struct. This ensures that the task doing the access has
> some hand in promoting pages - even among deduplicated read only files.
>
> We use numa_hint_fault_latency to help identify when a folio is accessed
> multiple times in a short period. Along with folio flag checks, this
> helps us minimize promoting pages on the first few accesses.
>
> The promotion node is always the local node of the promoting cpu.
>
> Suggested-by: Johannes Weiner <hannes@...xchg.org>
> Signed-off-by: Gregory Price <gourry@...rry.net>
> ---
> .../ABI/testing/sysfs-kernel-mm-numa | 20 +++++++
> include/linux/memory-tiers.h | 2 +
> include/linux/migrate.h | 2 +
> include/linux/sched.h | 3 +
> include/linux/sched/numa_balancing.h | 5 ++
> init/init_task.c | 1 +
> kernel/sched/fair.c | 26 +++++++-
> mm/memory-tiers.c | 27 +++++++++
> mm/migrate.c | 59 +++++++++++++++++++
> mm/swap.c | 3 +
> 10 files changed, 147 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-numa b/Documentation/ABI/testing/sysfs-kernel-mm-numa
> index 77e559d4ed80..b846e7d80cba 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-mm-numa
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-numa
> @@ -22,3 +22,23 @@ Description: Enable/disable demoting pages during reclaim
> the guarantees of cpusets. This should not be enabled
> on systems which need strict cpuset location
> guarantees.
> +
> +What: /sys/kernel/mm/numa/pagecache_promotion_enabled
> +Date: November 2024
> +Contact: Linux memory management mailing list <linux-mm@...ck.org>
> +Description: Enable/disable promoting pages during file access
> +
> + Page migration during file access is intended for systems
> + with tiered memory configurations that have significant
> + unmapped file cache usage. By default, file cache memory
> + on slower tiers will not be opportunistically promoted by
> + normal NUMA hint faults, because the system has no way to
> + track them. This option enables opportunistic promotion
> + of pages that are accessed via syscall (e.g. read/write)
> + if multiple accesses occur in quick succession.
> +
> + It may move data to a NUMA node that does not fall into
> + the cpuset of the allocating process which might be
> + construed to violate the guarantees of cpusets. This
> + should not be enabled on systems which need strict cpuset
> + location guarantees.
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 0dc0cf2863e2..fa96a67b8996 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -37,6 +37,7 @@ struct access_coordinate;
>
> #ifdef CONFIG_NUMA
> extern bool numa_demotion_enabled;
> +extern bool numa_pagecache_promotion_enabled;
> extern struct memory_dev_type *default_dram_type;
> extern nodemask_t default_dram_nodes;
> struct memory_dev_type *alloc_memory_type(int adistance);
> @@ -76,6 +77,7 @@ static inline bool node_is_toptier(int node)
> #else
>
> #define numa_demotion_enabled false
> +#define numa_pagecache_promotion_enabled false
> #define default_dram_type NULL
> #define default_dram_nodes NODE_MASK_NONE
> /*
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 29919faea2f1..cf58a97d4216 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -145,6 +145,7 @@ const struct movable_operations *page_movable_ops(struct page *page)
> int migrate_misplaced_folio_prepare(struct folio *folio,
> struct vm_area_struct *vma, int node);
> int migrate_misplaced_folio(struct folio *folio, int node);
> +void promotion_candidate(struct folio *folio);
> #else
> static inline int migrate_misplaced_folio_prepare(struct folio *folio,
> struct vm_area_struct *vma, int node)
> @@ -155,6 +156,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
> {
> return -EAGAIN; /* can't migrate now */
> }
> +static inline void promotion_candidate(struct folio *folio) { }
> #endif /* CONFIG_NUMA_BALANCING */
>
> #ifdef CONFIG_MIGRATION
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d380bffee2ef..faa84fb7a756 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1356,6 +1356,9 @@ struct task_struct {
> unsigned long numa_faults_locality[3];
>
> unsigned long numa_pages_migrated;
> +
> + struct callback_head numa_promo_work;
> + struct list_head promo_list;
> #endif /* CONFIG_NUMA_BALANCING */
>
> #ifdef CONFIG_RSEQ
> diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
> index 52b22c5c396d..cc7750d754ff 100644
> --- a/include/linux/sched/numa_balancing.h
> +++ b/include/linux/sched/numa_balancing.h
> @@ -32,6 +32,7 @@ extern void set_numabalancing_state(bool enabled);
> extern void task_numa_free(struct task_struct *p, bool final);
> bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
> int src_nid, int dst_cpu);
> +int numa_hint_fault_latency(struct folio *folio);
> #else
> static inline void task_numa_fault(int last_node, int node, int pages,
> int flags)
> @@ -52,6 +53,10 @@ static inline bool should_numa_migrate_memory(struct task_struct *p,
> {
> return true;
> }
> +static inline int numa_hint_fault_latency(struct folio *folio)
> +{
> + return 0;
> +}
> #endif
>
> #endif /* _LINUX_SCHED_NUMA_BALANCING_H */
> diff --git a/init/init_task.c b/init/init_task.c
> index e557f622bd90..f831980748c4 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -187,6 +187,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
> .numa_preferred_nid = NUMA_NO_NODE,
> .numa_group = NULL,
> .numa_faults = NULL,
> + .promo_list = LIST_HEAD_INIT(init_task.promo_list),
> #endif
> #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
> .kasan_depth = 1,
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a59ae2e23daf..047f02091773 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -42,6 +42,7 @@
> #include <linux/interrupt.h>
> #include <linux/memory-tiers.h>
> #include <linux/mempolicy.h>
> +#include <linux/migrate.h>
> #include <linux/mutex_api.h>
> #include <linux/profile.h>
> #include <linux/psi.h>
> @@ -1842,7 +1843,7 @@ static bool pgdat_free_space_enough(struct pglist_data *pgdat)
> * The smaller the hint page fault latency, the higher the possibility
> * for the page to be hot.
> */
> -static int numa_hint_fault_latency(struct folio *folio)
> +int numa_hint_fault_latency(struct folio *folio)
> {
> int last_time, time;
>
> @@ -3534,6 +3535,27 @@ static void task_numa_work(struct callback_head *work)
> }
> }
>
> +static void task_numa_promotion_work(struct callback_head *work)
> +{
> + struct task_struct *p = current;
> + struct list_head *promo_list = &p->promo_list;
> + struct folio *folio, *tmp;
> + int nid = numa_node_id();
> +
> + SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_promo_work));
> +
> + work->next = work;
> +
> + if (list_empty(promo_list))
> + return;
> +
> + list_for_each_entry_safe(folio, tmp, promo_list, lru) {
> + list_del_init(&folio->lru);
> + migrate_misplaced_folio(folio, nid);
> + }
> +}
> +
> +
> void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
> {
> int mm_users = 0;
> @@ -3558,8 +3580,10 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
> RCU_INIT_POINTER(p->numa_group, NULL);
> p->last_task_numa_placement = 0;
> p->last_sum_exec_runtime = 0;
> + INIT_LIST_HEAD(&p->promo_list);
>
> init_task_work(&p->numa_work, task_numa_work);
> + init_task_work(&p->numa_promo_work, task_numa_promotion_work);
>
> /* New address space, reset the preferred nid */
> if (!(clone_flags & CLONE_VM)) {
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index fc14fe53e9b7..4c44598e485e 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -935,6 +935,7 @@ static int __init memory_tier_init(void)
> subsys_initcall(memory_tier_init);
>
> bool numa_demotion_enabled = false;
> +bool numa_pagecache_promotion_enabled;
>
> #ifdef CONFIG_MIGRATION
> #ifdef CONFIG_SYSFS
> @@ -957,11 +958,37 @@ static ssize_t demotion_enabled_store(struct kobject *kobj,
> return count;
> }
>
> +static ssize_t pagecache_promotion_enabled_show(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + char *buf)
> +{
> + return sysfs_emit(buf, "%s\n",
> + numa_pagecache_promotion_enabled ? "true" : "false");
> +}
> +
> +static ssize_t pagecache_promotion_enabled_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + ssize_t ret;
> +
> + ret = kstrtobool(buf, &numa_pagecache_promotion_enabled);
> + if (ret)
> + return ret;
> +
> + return count;
> +}
> +
> +
> static struct kobj_attribute numa_demotion_enabled_attr =
> __ATTR_RW(demotion_enabled);
>
> +static struct kobj_attribute numa_pagecache_promotion_enabled_attr =
> + __ATTR_RW(pagecache_promotion_enabled);
> +
> static struct attribute *numa_attrs[] = {
> &numa_demotion_enabled_attr.attr,
> + &numa_pagecache_promotion_enabled_attr.attr,
> NULL,
> };
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index af07b399060b..320258a1aaba 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -44,6 +44,8 @@
> #include <linux/sched/sysctl.h>
> #include <linux/memory-tiers.h>
> #include <linux/pagewalk.h>
> +#include <linux/sched/numa_balancing.h>
> +#include <linux/task_work.h>
>
> #include <asm/tlbflush.h>
>
> @@ -2710,5 +2712,62 @@ int migrate_misplaced_folio(struct folio *folio, int node)
> BUG_ON(!list_empty(&migratepages));
> return nr_remaining ? -EAGAIN : 0;
> }
> +
> +/**
> + * promotion_candidate() - report a promotion candidate folio
> + *
> + * @folio: The folio reported as a candidate
> + *
> + * Records folio access time and places the folio on the task promotion list
> + * if access time is less than the threshold. The folio will be isolated from
> + * LRU if selected, and task_work will putback the folio on promotion failure.
> + *
> + * If selected, takes a folio reference to be released in task work.
> + */
> +void promotion_candidate(struct folio *folio)
> +{
> + struct task_struct *task = current;
> + struct list_head *promo_list = &task->promo_list;
> + struct callback_head *work = &task->numa_promo_work;
> + struct address_space *mapping = folio_mapping(folio);
> + bool write = mapping ? mapping->gfp_mask & __GFP_WRITE : false;
> + int nid = folio_nid(folio);
> + int flags, last_cpupid;
> +
> + /*
> + * Only do this work if:
> + * 1) tiering and pagecache promotion are enabled
> + * 2) the page can actually be promoted
> + * 3) The hint-fault latency is relatively hot
> + * 4) the folio is not already isolated
> + * 5) This is not a kernel thread context
> + */
> + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
> + !numa_pagecache_promotion_enabled ||
> + node_is_toptier(nid) ||
> + numa_hint_fault_latency(folio) >= PAGE_ACCESS_TIME_MASK ||
> + folio_test_isolated(folio) ||
> + (current->flags & PF_KTHREAD)) {
> + return;
> + }
> +
> + nid = numa_migrate_check(folio, NULL, 0, &flags, write, &last_cpupid);
> + if (nid == NUMA_NO_NODE)
> + return;
> +
> + if (migrate_misplaced_folio_prepare(folio, NULL, nid))
> + return;
> +
> + /*
> + * Ensure task can schedule work, otherwise we'll leak folios.
> + * If the list is not empty, task work has already been scheduled.
> + */
> + if (list_empty(promo_list) && task_work_add(task, work, TWA_RESUME)) {
> + folio_putback_lru(folio);
> + return;
> + }
> + list_add(&folio->lru, promo_list);
> +}
> +EXPORT_SYMBOL(promotion_candidate);
> #endif /* CONFIG_NUMA_BALANCING */
> #endif /* CONFIG_NUMA */
> diff --git a/mm/swap.c b/mm/swap.c
> index 320b959b74c6..57909c349388 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -37,6 +37,7 @@
> #include <linux/page_idle.h>
> #include <linux/local_lock.h>
> #include <linux/buffer_head.h>
> +#include <linux/migrate.h>
>
> #include "internal.h"
>
> @@ -469,6 +470,8 @@ void folio_mark_accessed(struct folio *folio)
> __lru_cache_activate_folio(folio);
> folio_clear_referenced(folio);
> workingset_activation(folio);
> + } else {
> +
In the current implementation, promotion will not work if we enable
MGLRU, right?
Is there any specific reason we are not enabling promotion with MGLRU?
> promotion_candidate(folio);
> }
> if (folio_test_idle(folio))
> folio_clear_idle(folio);
Powered by blists - more mailing lists