linux-kernel - Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <47f42c60-9752-4bc6-9079-627b6e0b9cfc@fujitsu.com>
Date: Mon, 23 Jun 2025 08:54:28 +0000
From: "Zhijian Li (Fujitsu)" <lizhijian@...itsu.com>
To: "Huang, Ying" <ying.huang@...ux.alibaba.com>
CC: "linux-mm@...ck.org" <linux-mm@...ck.org>, "akpm@...ux-foundation.org"
	<akpm@...ux-foundation.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "Yasunori Gotou (Fujitsu)"
	<y-goto@...itsu.com>, Ingo Molnar <mingo@...hat.com>, Peter Zijlstra
	<peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel
 Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, kernel
 test robot <lkp@...el.com>
Subject: Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE
 accounting



On 20/06/2025 14:28, Huang, Ying wrote:
> Li Zhijian <lizhijian@...itsu.com> writes:
> 
>> Goto-san reported confusing pgpromote statistics where
>> the pgpromote_success count significantly exceeded pgpromote_candidate.
>> The issue manifests under specific memory pressure conditions:
>> when top-tier memory (DRAM) is exhausted by memhog and allocation begins
>> in lower-tier memory (CXL). After terminating memhog, the stats show:
> 
> The above description is confusing.  The page promotion occurs when the
> size of the top-tier free space is large enough (after killing the
> memhog above).  The accessed lower-tier memory will be promoted upon
> accessing to take full advantage of the more expensive top-tier memory.

Yeah, that's what the promotion does.

Let's clarify the reproducer steps specifically(thanks Goto-san for the reproducer):
On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB):

# Enable demotion only
echo 1 > /sys/kernel/mm/numa/demotion_enabled
numactl -m 0-1 memhog -r200 3500M >/dev/null &
pid=$!
sleep 2
numactl memhog -r100 2500M >/dev/null &
sleep 10
kill -9 $pid
# Enable promotion
echo 2 > /proc/sys/kernel/numa_balancing

# After a few seconds, we observe `pgpromote_candidate < pgpromote_success`

In this scenario, after terminating the first memhog, the conditions for pgdat_free_space_enough() are quickly met, triggering promotion.
However, these migrated pages are only accounted for in PGPROMOTE_SUCCESS, not in PGPROMOTE_CANDIDATE.


> 
>> $ grep -e pgpromote /proc/vmstat
>> pgpromote_success 2579
>> pgpromote_candidate 1
>>
>> This update increments PGPROMOTE_CANDIDATE within the free space branch
>> when a promotion decision is made, which may alter the mechanism of the
>> rate limit. Consequently, it becomes easier to reach the rate limit than
>> it was previously.
>>
>> For example:
>> Rate Limit = 100 pages/sec
>> Scenario:
>>    T0: 90 free-space migrations
>>    T0+100ms: 20-page migration request
>>
>> Before:
>>    Rate limit is *not* reached: 0 + 20 = 20 < 100
>>    PGPROMOTE_CANDIDATE: 20
>> After:
>>    Rate limit is reached: 90 + 20 = 110 > 100
>>    PGPROMOTE_CANDIDATE: 110
> 
> Yes.  The rate limit will be influenced by the change.  So, more tests
> may be needed to verify it will not incurs regressions.


Testing this might be challenging due to workload dependencies. Do you have any recommended workloads for evaluation?
Alternatively, could we could rely on the LKP project for impact assessment(Current patch has not really tested
by LKP due to a compiling error, I will post a V2 soon).

However, regarding the rate limit change itself, I consider this patch logically correct. As stated in the numa_promotion_rate_limit() comment:
> "For memory tiering mode, too high promotion/demotion throughput may hurt application latency."
It seems there is no justification for excluding pgdat_free_space_enough() triggered promotions from the rate limiting mechanism.



> 
>>
>> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@...itsu.com>
>> Signed-off-by: Li Zhijian <lizhijian@...itsu.com>
>> ---
>>
>> This is markes as RFC because I am uncertain whether we originally
>> intended for this or if it was overlooked.
>>
>> However, the current situation where pgpromote_candidate < pgpromote_success
>> is indeed confusing when interpreted literally.
>>
>> Cc: Huang Ying <ying.huang@...ux.alibaba.com>
>> Cc: Ingo Molnar <mingo@...hat.com>
>> Cc: Peter Zijlstra <peterz@...radead.org>
>> Cc: Juri Lelli <juri.lelli@...hat.com>
>> Cc: Vincent Guittot <vincent.guittot@...aro.org>
>> Cc: Dietmar Eggemann <dietmar.eggemann@....com>
>> Cc: Steven Rostedt <rostedt@...dmis.org>
>> Cc: Ben Segall <bsegall@...gle.com>
>> Cc: Mel Gorman <mgorman@...e.de>
>> Cc: Valentin Schneider <vschneid@...hat.com>
>> ---
>>   kernel/sched/fair.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7a14da5396fb..4715cd4fa248 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1940,11 +1940,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>>   		struct pglist_data *pgdat;
>>   		unsigned long rate_limit;
>>   		unsigned int latency, th, def_th;
>> +		long nr = folio_nr_pages(folio)


Cc LKP

There is a compilation error which I overlooked at the time due to several ongoing refactors in
my local code. I appreciate LKP for detecting this issue.


Thanks
Zhijian


>>   
>>   		pgdat = NODE_DATA(dst_nid);
>>   		if (pgdat_free_space_enough(pgdat)) {
>>   			/* workload changed, reset hot threshold */
>>   			pgdat->nbp_threshold = 0;
>> +			mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
>>   			return true;
>>   		}
>>   
>> @@ -1958,8 +1960,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>>   		if (latency >= th)
>>   			return false;
>>   
>> -		return !numa_promotion_rate_limit(pgdat, rate_limit,
>> -						  folio_nr_pages(folio));
>> +		return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
>>   	}
>>   
>>   	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
> 
> ---
> Best Regards,
> Huang, Ying