linux-kernel - Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250820154855.2002698-1-joshua.hahnjy@gmail.com>
Date: Wed, 20 Aug 2025 08:48:54 -0700
From: Joshua Hahn <joshua.hahnjy@...il.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Johannes Weiner <hannes@...xchg.org>,
	Chris Mason <clm@...com>,
	Vlastimil Babka <vbabka@...e.cz>,
	Suren Baghdasaryan <surenb@...gle.com>,
	Michal Hocko <mhocko@...e.com>,
	Brendan Jackman <jackmanb@...gle.com>,
	Zi Yan <ziy@...dia.com>,
	linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	kernel-team@...a.com
Subject: Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing

On Tue, 19 Aug 2025 22:41:11 -0700 Andrew Morton <akpm@...ux-foundation.org> wrote:

Hello Andrew, thank you again for your input : -)

> On Mon, 18 Aug 2025 11:58:03 -0700 Joshua Hahn <joshua.hahnjy@...il.com> wrote:
> 
> > While testing workloads with high sustained memory pressure on large machines
> > (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> > Further investigation showed that the lock in free_pcppages_bulk was being held
> > for a long time, even being held while 2k+ pages were being freed.
> 
> It would be interesting to share some of those softlockup traces.

Unfortunately it has been a long time since these softlockups have been detected
on our fleet, so the records of them have disappeared : -(

What I do have is an example trace of a rcu stall warning:

[ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
[ 4512.604370] rcu: 	20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
[ 4512.626401] rcu: 	         hardirqs   softirqs   csw/system
[ 4512.638793] rcu: 	 number:        0        145            0
[ 4512.651177] rcu: 	cputime:       30      10410          174   ==> 10558(ms)
[ 4512.666657] rcu: 	(t=21077 jiffies g=783665 q=1242213 ncpus=316)

And here is the trace that accompanies it:

[ 4512.666815] RIP: 0010:free_unref_folios+0x47d/0xd80
[ 4512.666818] Code: 00 00 31 ff 40 80 ce 01 41 88 76 18 e9 a8 fe ff ff 40 84 ff 0f 84 d6 00 00 00 39 f0 0f 4c f0 4c 89 ff 4c 89 f2 e8 13 f2 fe ff <49> f7 87 88 05 00 00 04 00 00 00 0f 84 00 ff ff ff 49 8b 47 20 49
[ 4512.666820] RSP: 0018:ffffc900a62f3878 EFLAGS: 00000206
[ 4512.666822] RAX: 000000000005ae80 RBX: 000000000000087a RCX: 0000000000000001
[ 4512.666824] RDX: 000000000000007d RSI: 0000000000000282 RDI: ffff89404c8ba310
[ 4512.666825] RBP: 0000000000000001 R08: ffff89404c8b9d80 R09: 0000000000000001
[ 4512.666826] R10: 0000000000000010 R11: 00000000000130de R12: ffff89404c8b9d80
[ 4512.666827] R13: ffffea01cf3c0000 R14: ffff893d3ac5aec0 R15: ffff89404c8b9d80
[ 4512.666833]  ? free_unref_folios+0x47d/0xd80
[ 4512.666836]  free_pages_and_swap_cache+0xcd/0x1a0
[ 4512.666847]  tlb_finish_mmu+0x11c/0x350
[ 4512.666850]  vms_clear_ptes+0xf9/0x120
[ 4512.666855]  __mmap_region+0x29a/0xc00
[ 4512.666867]  do_mmap+0x34e/0x910
[ 4512.666873]  vm_mmap_pgoff+0xbb/0x200
[ 4512.666877]  ? hrtimer_interrupt+0x337/0x5c0
[ 4512.666879]  ? sched_clock+0x5/0x10
[ 4512.666882]  ? sched_clock_cpu+0xc/0x170
[ 4512.666885]  ? irqtime_account_irq+0x2b/0xa0
[ 4512.666888]  do_syscall_64+0x68/0x130
[ 4512.666892]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 4512.666896] RIP: 0033:0x7f1afe9257e2

> We have this CONFIG_PCP_BATCH_SCALE_MAX which appears to exist to
> address precisely this issue.  But only about half of the
> free_pcppages_bulk() callers actually honor it.

I see. I think this makes sense, and I also agree that there should probably be
some guardrails from the callers of this function, especially since free
pcppages bulk is unaware of how the pcp lock is acquired / freed.

Functions like drain_zone_pages, which explicitly enforce this by setting
"to_drain" to be min(pcp->batch, pcp->count) seem like a smart way to do this.

> So perhaps the fix is to fix the callers which forgot to implement this?
> 
> - decay_pcp_high() tried to implement CONFIG_PCP_BATCH_SCALE_MAX, but
>   that code hurts my brain.

To be honest, I don't fully understand decay_pcp_high() as well : -)
>From what I can tell, it seems like CONFIG_PCP_BATCH_SCALE_MAX doesn't directly
pass in a value that limits how many pages are freed at once for the bulk
freer, but rather tunes the parameters pcp->high. (Except for drain_pages_zone,
which you have pointed out below).

> - drain_pages_zone() implements it but, regrettably, doesn't use it
>   to periodically release pcp->lock.  Room for improvement there.

>From what I can see, it seems like drain_pages_zone() does release pcp->lock
every pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX (simplified pseudocode below)

do {
	spin_lock(&pcp->lock);
	to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
	free_pcppages_bulk(zone, to_drain, ...);
	spin_unlock(&pcp->lock);
} while (count);

Although, the concern you raised earlier of whether another thread can
reasonably grab pcp->lock during the short check between the unlock and lock
is still valid here (which means the concern is also relieved if the machine is
x86, arm64, or any other arch that defaults spin locks to be queued).

With all of this said, I think adding the periodic unlocking / locking of the
zone lock within free_pcppages_bulk still makes sense; if the caller enforces
count to be <= pcp->batch, then the check is essentially a no-op; otherwise, we
create some locking safety, whihc would protect it in case there are any new
callers in the future, which forget to add the check as well.

Thank you for your thoughtful review, Andrew. I hope you have a great day!
Joshua

Sent using hkml (https://github.com/sjp38/hackermail)