linux-kernel - Re: [PATCH] mm/page_alloc: Consider PCP pages as part of

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251021180435.11789-1-jinji.z.zhong@gmail.com>
Date: Tue, 21 Oct 2025 18:04:35 +0000
From: jinji zhong <jinji.z.zhong@...il.com>
To: joshua.hahnjy@...il.com
Cc: Liam.Howlett@...cle.com,
	akpm@...ux-foundation.org,
	axelrasmussen@...gle.com,
	david@...hat.com,
	feng.han@...or.com,
	hannes@...xchg.org,
	jackmanb@...gle.com,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	liulu.liu@...or.com,
	lorenzo.stoakes@...cle.com,
	mhocko@...e.com,
	rppt@...nel.org,
	shakeel.butt@...ux.dev,
	surenb@...gle.com,
	vbabka@...e.cz,
	weixugc@...gle.com,
	yuanchu@...gle.com,
	zhengqi.arch@...edance.com,
	zhongjinji@...or.com,
	ziy@...dia.com
Subject: Re: [PATCH] mm/page_alloc: Consider PCP pages as part of

> Zhongjinji, thank you for your patch!
>
>> When free_pages becomes critically low, the kernel prevents other tasks
>> from entering the slow path to ensure that reclaiming tasks can
>> successfully allocate memory.
>> 
>> This blocking is important to avoid memory contention with reclaiming
>> tasks. However, in some cases it is unnecessary because the PCP list may
>> already contain sufficient pages, as freed pages are first placed there
>> and are not immediately visible to the buddy system.
>
>Based on my limiting understanding of pcp free pages, I had a concern here
>on whether this would really provide the desired effect. That is, the pages
>in the pcp are not available to the buddy allocator unless we drain the pcp
>lists (and this operation is not free), I was unsure if there was a clear
>benefit to allowing the system to go unblocked.

The purpose of this patch is to account for the memory in the pcp list
within memalloc_reserve, which would allow more tasks to enter the slow
path. For critical tasks, entering the slow path represents a better
alternative than being throttled until kswapd wakes up.

>If we are already at the point where we need the pcp pages to have enough
>free pages to go over the watermark, perhaps it makes sense to just block
>tasks for now, and enter direct reclaim? Allowing more allocations might
>lead the system to be in a worse state than before, and will have to
>go through direct reclaim anyways.
>
>Please let me know if this makes sense!
> 
>> By accounting PCP pages as part of pfmemalloc_reserve, we can reduce
>> unnecessary blocking and improve system responsiveness under low-memory
>> conditions.
>> 
>> Signed-off-by: zhongjinji <zhongjinji@...or.com>
>
>[...snip...]
>
>> +int zone_pcp_pages_count(struct zone *zone)
>> +{
>> +	struct per_cpu_pages *pcp;
>> +	int total_pcp_pages = 0;
>> +	int cpu;
>> +
>> +	for_each_online_cpu(cpu) {
>> +		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
>> +		total_pcp_pages += pcp->count;
>
>Could this be racy? What is stopping the pcp count from decreasing while we
>are iterating over each online cpu, over each managed zone? Under the
>memory pressure conditions that this patch is aiming to fix, I think that
>there is a good chance the numer we get here will be very outdated by the time
>we try to take action based on it, and we may require the system to be
>further stalled since we don't take action to reclaim memory.

Thank you, Joshua. Indeed, the pcp->count might be outdated by the time. 
And kswapd will fail to allocate memory if the pages in the pcp lists are
on other CPUs. While drain_all_pages() may be triggered by direct reclaim
on other CPUs, some hard-to-predict scenarios might still exist.

Perhaps performing drain_all_pages() before actually calling
throttle_direct_reclaim() would be better.

Like the following code.

@@ -6535,6 +6535,7 @@ static bool allow_direct_reclaim(pg_data_t *pgdat)
 static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 					nodemask_t *nodemask)
 {
+	bool drained = false;
 	struct zoneref *z;
 	struct zone *zone;
 	pg_data_t *pgdat = NULL;
@@ -6570,6 +6571,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 	 * for remote pfmemalloc reserves and processes on different nodes
 	 * should make reasonable progress.
 	 */
+retry:
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(gfp_mask), nodemask) {
 		if (zone_idx(zone) > ZONE_NORMAL)
@@ -6586,6 +6588,12 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 	if (!pgdat)
 		goto out;
 
+	if (!drained) {
+		drained = true;
+		drain_all_pages(NULL);
+		goto retry;
+	}
+
 	/* Account for the throttling */
 	count_vm_event(PGSCAN_DIRECT_THROTTLE);

>[...snip...]
>
>Please feel free to let me know if I am missing something obvious. Again,
>I am not very familiar with the pcp code, so there is a good chance that
>you are seeing something that I am not : -)
>
>Thank you for the patch, I hope you have a great day!
>Joshua
>