linux-kernel - [PATCH RFC v2] mm: memory-tiering: Fix PGPROMOTE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250625021352.2291544-1-lizhijian@fujitsu.com>
Date: Wed, 25 Jun 2025 10:13:52 +0800
From: Li Zhijian <lizhijian@...itsu.com>
To: linux-mm@...ck.org
Cc: akpm@...ux-foundation.org,
	linux-kernel@...r.kernel.org,
	y-goto@...itsu.com,
	Li Zhijian <lizhijian@...itsu.com>,
	Huang Ying <ying.huang@...ux.alibaba.com>,
	Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>,
	Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	lkp@...el.com
Subject: [PATCH RFC v2] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting

Goto-san reported confusing pgpromote statistics where
the pgpromote_success count significantly exceeded pgpromote_candidate.

On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB):
 # Enable demotion only
 echo 1 > /sys/kernel/mm/numa/demotion_enabled
 numactl -m 0-1 memhog -r200 3500M >/dev/null &
 pid=$!
 sleep 2
 numactl memhog -r100 2500M >/dev/null &
 sleep 10
 kill -9 $pid # terminate the 1st memhog
 # Enable promotion
 echo 2 > /proc/sys/kernel/numa_balancing

After a few seconds, we observeed `pgpromote_candidate < pgpromote_success`
$ grep -e pgpromote /proc/vmstat
pgpromote_success 2579
pgpromote_candidate 0

In this scenario, after terminating the first memhog, the conditions for
pgdat_free_space_enough() are quickly met, triggering promotion.
However, these migrated pages are only accounted for in PGPROMOTE_SUCCESS,
not in PGPROMOTE_CANDIDATE.

This update increments PGPROMOTE_CANDIDATE within the free space branch
when a promotion decision is made, which may alter the mechanism of the
rate limit. Consequently, it becomes easier to reach the rate limit than
it was previously.

For example:
Rate Limit = 100 pages/sec
Scenario:
  T0: 90 free-space migrations
  T0+100ms: 20-page migration request

Before:
  Rate limit is *not* reached: 0 + 20 = 20 < 100
  PGPROMOTE_CANDIDATE: 20
After:
  Rate limit is reached: 90 + 20 = 110 > 100
  PGPROMOTE_CANDIDATE: 110

Due to the fact that the rate limit mechanism recalculates every second,
theoretically, only within that one second can the transition from
pgdat_free_space_enough() to !pgdat_free_space_enough() in top-tier
remaining memory be affected.

Moreover, previously, within this one-second span, promotions caused by
pgdat_free_space_enough() are not restricted by rate limits.
This theoretically makes it easier to cause application latency.

The current modification can better control the rate limit in cases of
transition from pgdat_free_space_enough() to !pgdat_free_space_enough()
within one second.

Cc: Huang Ying <ying.huang@...ux.alibaba.com>
Cc: Ingo Molnar <mingo@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Juri Lelli <juri.lelli@...hat.com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Steven Rostedt <rostedt@...dmis.org>
Cc: Ben Segall <bsegall@...gle.com>
Cc: Mel Gorman <mgorman@...e.de>
Cc: Valentin Schneider <vschneid@...hat.com>
Reported-by: Yasunori Gotou (Fujitsu) <y-goto@...itsu.com>
Signed-off-by: Li Zhijian <lizhijian@...itsu.com>
---
V2:
Fix compiling error # Reported by LKP

As Ying suggested, we need to assess whether this change causes regression.
However, considering the stringent conditions this patch involves,
properly evaluating it may be challenging, as the outcomes depend on your
perspective. Much like in a zero-sum game, if someone benefits, another
might lose.

If there are subsequent results, I will update them here.

Cc: lkp@...el.com
Here, I hope to leverage the existing LKP benchmark to evaluate the
potential impacts. The ideal evaluation conditions are:
1. Installed with DRAM + NVDIMM (which can be simulated).
2. NVDIMM is used as system RAM (configurable via daxctl).
3. Promotion is enabled (`echo 2 > /proc/sys/kernel/numa_balancing`).

Alternative:
We can indeed eliminate the potential impact within
pgdat_free_space_enough(), so that the rate limit behavior remains as
before.

For instance, consider the following change:
                if (pgdat_free_space_enough(pgdat)) {
                        /* workload changed, reset hot threshold */
                        pgdat->nbp_threshold = 0;
+                        pgdat->nbp_rl_nr_cand += nr;
                        mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
                        return true;
                }

RFC:
I am uncertain whether we originally intended for this discrepancy or if
it was overlooked.

However, the current situation where pgpromote_candidate < pgpromote_success
is indeed confusing when interpreted literally.
---
 kernel/sched/fair.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a14da5396fb..505b40f8897a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1940,11 +1940,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 		struct pglist_data *pgdat;
 		unsigned long rate_limit;
 		unsigned int latency, th, def_th;
+		long nr = folio_nr_pages(folio);
 
 		pgdat = NODE_DATA(dst_nid);
 		if (pgdat_free_space_enough(pgdat)) {
 			/* workload changed, reset hot threshold */
 			pgdat->nbp_threshold = 0;
+			mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
 			return true;
 		}
 
@@ -1958,8 +1960,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 		if (latency >= th)
 			return false;
 
-		return !numa_promotion_rate_limit(pgdat, rate_limit,
-						  folio_nr_pages(folio));
+		return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
 	}
 
 	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
-- 
2.41.0