linux-kernel - Re: [PATCH] mm/page_alloc: Consider PCP pages as part of

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1ed60558-cd6d-43c4-b8e2-e9492e5aac2e@suse.cz>
Date: Wed, 22 Oct 2025 09:41:43 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: jinji zhong <jinji.z.zhong@...il.com>, joshua.hahnjy@...il.com
Cc: Liam.Howlett@...cle.com, akpm@...ux-foundation.org,
 axelrasmussen@...gle.com, david@...hat.com, feng.han@...or.com,
 hannes@...xchg.org, jackmanb@...gle.com, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org, liulu.liu@...or.com, lorenzo.stoakes@...cle.com,
 mhocko@...e.com, rppt@...nel.org, shakeel.butt@...ux.dev, surenb@...gle.com,
 weixugc@...gle.com, yuanchu@...gle.com, zhengqi.arch@...edance.com,
 zhongjinji@...or.com, ziy@...dia.com
Subject: Re: [PATCH] mm/page_alloc: Consider PCP pages as part of

On 10/21/25 20:04, jinji zhong wrote:
>> Zhongjinji, thank you for your patch!

Hi,

>>> When free_pages becomes critically low, the kernel prevents other tasks
>>> from entering the slow path to ensure that reclaiming tasks can
>>> successfully allocate memory.
>>> 
>>> This blocking is important to avoid memory contention with reclaiming
>>> tasks. However, in some cases it is unnecessary because the PCP list may
>>> already contain sufficient pages, as freed pages are first placed there
>>> and are not immediately visible to the buddy system.
>>
>>Based on my limiting understanding of pcp free pages, I had a concern here
>>on whether this would really provide the desired effect. That is, the pages
>>in the pcp are not available to the buddy allocator unless we drain the pcp
>>lists (and this operation is not free), I was unsure if there was a clear
>>benefit to allowing the system to go unblocked.
> 
> The purpose of this patch is to account for the memory in the pcp list
> within memalloc_reserve, which would allow more tasks to enter the slow
> path. For critical tasks, entering the slow path represents a better
> alternative than being throttled until kswapd wakes up.

Were you able to measure the benefits of the patch? What are the "critical
tasks"? Usually such task use mlockall() to avoid allocating memory during
operation. Maybe we should instead extempt some task priority classes from
being throttled?

>>If we are already at the point where we need the pcp pages to have enough
>>free pages to go over the watermark, perhaps it makes sense to just block
>>tasks for now, and enter direct reclaim? Allowing more allocations might
>>lead the system to be in a worse state than before, and will have to
>>go through direct reclaim anyways.
>>
>>Please let me know if this makes sense!
>> 
>>> By accounting PCP pages as part of pfmemalloc_reserve, we can reduce
>>> unnecessary blocking and improve system responsiveness under low-memory
>>> conditions.
>>> 
>>> Signed-off-by: zhongjinji <zhongjinji@...or.com>
>>
>>[...snip...]
>>
>>> +int zone_pcp_pages_count(struct zone *zone)
>>> +{
>>> +	struct per_cpu_pages *pcp;
>>> +	int total_pcp_pages = 0;
>>> +	int cpu;
>>> +
>>> +	for_each_online_cpu(cpu) {
>>> +		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
>>> +		total_pcp_pages += pcp->count;
>>
>>Could this be racy? What is stopping the pcp count from decreasing while we
>>are iterating over each online cpu, over each managed zone? Under the
>>memory pressure conditions that this patch is aiming to fix, I think that
>>there is a good chance the numer we get here will be very outdated by the time
>>we try to take action based on it, and we may require the system to be
>>further stalled since we don't take action to reclaim memory.
> 
> Thank you, Joshua. Indeed, the pcp->count might be outdated by the time. 
> And kswapd will fail to allocate memory if the pages in the pcp lists are
> on other CPUs. While drain_all_pages() may be triggered by direct reclaim
> on other CPUs, some hard-to-predict scenarios might still exist.
> 
> Perhaps performing drain_all_pages() before actually calling
> throttle_direct_reclaim() would be better.

Draining makes more sense indeed. But it should be backed by data showing
improvement. As you say, draining is probably already happening via direct
reclaim.

> Like the following code.
> 
> @@ -6535,6 +6535,7 @@ static bool allow_direct_reclaim(pg_data_t *pgdat)
>  static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  					nodemask_t *nodemask)
>  {
> +	bool drained = false;
>  	struct zoneref *z;
>  	struct zone *zone;
>  	pg_data_t *pgdat = NULL;
> @@ -6570,6 +6571,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  	 * for remote pfmemalloc reserves and processes on different nodes
>  	 * should make reasonable progress.
>  	 */
> +retry:
>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>  					gfp_zone(gfp_mask), nodemask) {
>  		if (zone_idx(zone) > ZONE_NORMAL)
> @@ -6586,6 +6588,12 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  	if (!pgdat)
>  		goto out;
>  
> +	if (!drained) {
> +		drained = true;
> +		drain_all_pages(NULL);
> +		goto retry;
> +	}
> +
>  	/* Account for the throttling */
>  	count_vm_event(PGSCAN_DIRECT_THROTTLE);
> 
>>[...snip...]
>>
>>Please feel free to let me know if I am missing something obvious. Again,
>>I am not very familiar with the pcp code, so there is a good chance that
>>you are seeing something that I am not : -)
>>
>>Thank you for the patch, I hope you have a great day!
>>Joshua
>>