linux-kernel - Re: [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251003133818.000017af@huawei.com>
Date: Fri, 3 Oct 2025 13:38:18 +0100
From: Jonathan Cameron <jonathan.cameron@...wei.com>
To: Bharata B Rao <bharata@....com>
CC: <linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
	<dave.hansen@...el.com>, <gourry@...rry.net>, <hannes@...xchg.org>,
	<mgorman@...hsingularity.net>, <mingo@...hat.com>, <peterz@...radead.org>,
	<raghavendra.kt@....com>, <riel@...riel.com>, <rientjes@...gle.com>,
	<sj@...nel.org>, <weixugc@...gle.com>, <willy@...radead.org>,
	<ying.huang@...ux.alibaba.com>, <ziy@...dia.com>, <dave@...olabs.net>,
	<nifan.cxl@...il.com>, <xuezhengchu@...wei.com>, <yiannis@...corp.com>,
	<akpm@...ux-foundation.org>, <david@...hat.com>, <byungchul@...com>,
	<kinseyho@...gle.com>, <joshua.hahnjy@...il.com>, <yuanchu@...gle.com>,
	<balbirs@...dia.com>, <alok.rathore@...sung.com>
Subject: Re: [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from
 NUMAB=2 to kpromoted

On Wed, 10 Sep 2025 20:16:53 +0530
Bharata B Rao <bharata@....com> wrote:

> Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING
> mode of NUMA Balancing) does hot page detection (via hint faults),
> hot page classification and eventual promotion, all by itself and
> sits within the scheduler.
> 
> With the new hot page tracking and promotion mechanism being
> available, NUMA Balancing can limit itself to detection of
> hot pages (via hint faults) and off-load rest of the
> functionality to the common hot page tracking system.
> 
> pghot_record_access(PGHOT_HINT_FAULT) API is used to feed the
> hot page info. In addition, the migration rate limiting and
> dynamic threshold logic are moved to kpromoted so that the same
> can be used for hot pages reported by other sources too.
> 
> Signed-off-by: Bharata B Rao <bharata@....com>

Making a direct replacement without any fallback to previous method
is going to need a lot of data to show there are no important regressions.

So bold move if that's the intent! 

J
> ---
>  include/linux/pghot.h |   2 +
>  kernel/sched/fair.c   | 149 ++----------------------------------------
>  mm/memory.c           |  32 ++-------
>  mm/pghot.c            | 132 +++++++++++++++++++++++++++++++++++--
>  4 files changed, 142 insertions(+), 173 deletions(-)
> 

> diff --git a/mm/pghot.c b/mm/pghot.c
> index 9f7581818b8f..9f5746892bce 100644
> --- a/mm/pghot.c
> +++ b/mm/pghot.c
> @@ -9,6 +9,9 @@
>   *
>   * kpromoted is a kernel thread that runs on each toptier node and
>   * promotes pages from max_heap.
> + *
> + * Migration rate-limiting and dynamic threshold logic implementations
> + * were moved from NUMA Balancing mode 2.
>   */
>  #include <linux/pghot.h>
>  #include <linux/kthread.h>
> @@ -34,6 +37,9 @@ static bool kpromoted_started __ro_after_init;
>  
>  static unsigned int sysctl_pghot_freq_window = KPROMOTED_FREQ_WINDOW;
>  
> +/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
> +static unsigned int sysctl_pghot_promote_rate_limit = 65536;

If the comment correlates with the value, this is 64 GiB/s?  That seems
unlikely if I guess possible.

> +
>  #ifdef CONFIG_SYSCTL
>  static const struct ctl_table pghot_sysctls[] = {
>  	{
> @@ -44,8 +50,17 @@ static const struct ctl_table pghot_sysctls[] = {
>  		.proc_handler	= proc_dointvec_minmax,
>  		.extra1		= SYSCTL_ZERO,
>  	},
> +	{
> +		.procname	= "pghot_promote_rate_limit_MBps",
> +		.data		= &sysctl_pghot_promote_rate_limit,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +	},
>  };
>  #endif
> +
Put that in earlier patch to reduce noise here.

>  static bool phi_heap_less(const void *lhs, const void *rhs, void *args)
>  {
>  	return (*(struct pghot_info **)lhs)->frequency >
> @@ -94,11 +109,99 @@ static bool phi_heap_insert(struct max_heap *phi_heap, struct pghot_info *phi)
>  	return true;
>  }
>  
> +/*
> + * For memory tiering mode, if there are enough free pages (more than
> + * enough watermark defined here) in fast memory node, to take full

I'd use enough_wmark   Just because "more than enough" is a common
English phrase and I at least tripped over that sentence as a result!

> + * advantage of fast memory capacity, all recently accessed slow
> + * memory pages will be migrated to fast memory node without
> + * considering hot threshold.
> + */
> +static bool pgdat_free_space_enough(struct pglist_data *pgdat)
> +{
> +	int z;
> +	unsigned long enough_wmark;
> +
> +	enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
> +			   pgdat->node_present_pages >> 4);
> +	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
> +		struct zone *zone = pgdat->node_zones + z;
> +
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		if (zone_watermark_ok(zone, 0,
> +				      promo_wmark_pages(zone) + enough_wmark,
> +				      ZONE_MOVABLE, 0))
> +			return true;
> +	}
> +	return false;
> +}

> +
> +static void kpromoted_promotion_adjust_threshold(struct pglist_data *pgdat,

Needs documentation of the algorithm and the reasons for various choices.

I see it is a code move though so maybe that's a job for another day.

> +						 unsigned long rate_limit,
> +						 unsigned int ref_th,
> +						 unsigned long now)
> +{
> +	unsigned int start, th_period, unit_th, th;
> +	unsigned long nr_cand, ref_cand, diff_cand;
> +
> +	now = jiffies_to_msecs(now);
> +	th_period = KPROMOTED_PROMOTION_THRESHOLD_WINDOW;
> +	start = pgdat->nbp_th_start;
> +	if (now - start > th_period &&
> +	    cmpxchg(&pgdat->nbp_th_start, start, now) == start) {
> +		ref_cand = rate_limit *
> +			KPROMOTED_PROMOTION_THRESHOLD_WINDOW / MSEC_PER_SEC;
> +		nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
> +		diff_cand = nr_cand - pgdat->nbp_th_nr_cand;
> +		unit_th = ref_th * 2 / KPROMOTED_MIGRATION_ADJUST_STEPS;
> +		th = pgdat->nbp_threshold ? : ref_th;
> +		if (diff_cand > ref_cand * 11 / 10)
> +			th = max(th - unit_th, unit_th);
> +		else if (diff_cand < ref_cand * 9 / 10)
> +			th = min(th + unit_th, ref_th * 2);
> +		pgdat->nbp_th_nr_cand = nr_cand;
> +		pgdat->nbp_threshold = th;
> +	}
> +}
 +
>  static bool phi_is_pfn_hot(struct pghot_info *phi)
>  {
>  	struct page *page = pfn_to_online_page(phi->pfn);
> -	unsigned long now = jiffies;
>  	struct folio *folio;
> +	struct pglist_data *pgdat;
> +	unsigned long rate_limit;
> +	unsigned int latency, th, def_th;
> +	unsigned long now = jiffies;
>  
Avoid the reorder.  Just put it here in first place if you prefer this.