[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250919195223.1560636-1-joshua.hahnjy@gmail.com>
Date: Fri, 19 Sep 2025 12:52:18 -0700
From: Joshua Hahn <joshua.hahnjy@...il.com>
To: Andrew Morton <akpm@...ux-foundation.org>,
Johannes Weiner <hannes@...xchg.org>
Cc: Chris Mason <clm@...com>,
Kiryl Shutsemau <kirill@...temov.name>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>,
Brendan Jackman <jackmanb@...gle.com>,
David Hildenbrand <david@...hat.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Michal Hocko <mhocko@...e.com>,
Mike Rapoport <rppt@...nel.org>,
Suren Baghdasaryan <surenb@...gle.com>,
Vlastimil Babka <vbabka@...e.cz>,
Zi Yan <ziy@...dia.com>,
linux-kernel@...r.kernel.org,
linux-mm@...ck.org,
kernel-team@...a.com
Subject: [PATCH 0/4] mm/page_alloc: Batch callers of free_pcppages_bulk
While testing workloads with high sustained memory pressure on large machines
(1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
Further investigation showed that the lock in free_pcppages_bulk was being held
for a long time, even being held while 2k+ pages were being freed [1].
This causes starvation in other processes for both the pcp and zone locks,
which can lead to softlockups that cause the system to stall [2].
[ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
[ 4512.604370] rcu: 20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
[ 4512.626401] rcu: hardirqs softirqs csw/system
[ 4512.638793] rcu: number: 0 145 0
[ 4512.651177] rcu: cputime: 30 10410 174 ==> 10558(ms)
[ 4512.666657] rcu: (t=21077 jiffies g=783665 q=1242213 ncpus=316)
And here is the trace that accompanies it:
[ 4512.666815] RIP: 0010:free_unref_folios+0x47d/0xd80
[ 4512.666818] Code: 00 00 31 ff 40 80 ce 01 41 88 76 18 e9 a8 fe ff ff 40 84 ff 0f 84 d6 00 00 00 39 f0 0f 4c f0 4c 89 ff 4c 89 f2 e8 13 f2 fe ff <49> f7 87 88 05 00 00 04 00 00 00 0f 84 00 ff ff ff 49 8b 47 20 49
[ 4512.666820] RSP: 0018:ffffc900a62f3878 EFLAGS: 00000206
[ 4512.666822] RAX: 000000000005ae80 RBX: 000000000000087a RCX: 0000000000000001
[ 4512.666824] RDX: 000000000000007d RSI: 0000000000000282 RDI: ffff89404c8ba310
[ 4512.666825] RBP: 0000000000000001 R08: ffff89404c8b9d80 R09: 0000000000000001
[ 4512.666826] R10: 0000000000000010 R11: 00000000000130de R12: ffff89404c8b9d80
[ 4512.666827] R13: ffffea01cf3c0000 R14: ffff893d3ac5aec0 R15: ffff89404c8b9d80
[ 4512.666833] ? free_unref_folios+0x47d/0xd80
[ 4512.666836] free_pages_and_swap_cache+0xcd/0x1a0
[ 4512.666847] tlb_finish_mmu+0x11c/0x350
[ 4512.666850] vms_clear_ptes+0xf9/0x120
[ 4512.666855] __mmap_region+0x29a/0xc00
[ 4512.666867] do_mmap+0x34e/0x910
[ 4512.666873] vm_mmap_pgoff+0xbb/0x200
[ 4512.666877] ? hrtimer_interrupt+0x337/0x5c0
[ 4512.666879] ? sched_clock+0x5/0x10
[ 4512.666882] ? sched_clock_cpu+0xc/0x170
[ 4512.666885] ? irqtime_account_irq+0x2b/0xa0
[ 4512.666888] do_syscall_64+0x68/0x130
[ 4512.666892] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 4512.666896] RIP: 0033:0x7f1afe9257e2
To prevent starvation in both the pcp and zone locks, batch the freeing of
pages using pcp->batch.
Because free_pcppages_bulk is called with both the pcp and zone lock,
relinquishing and reacquiring the locks are only effective when both of them
are broken together. Thus, instead of modifying free_pcppages_bulk to break
both locks, batch the freeing from its callers instead.
In our fleet, we have seen that performing batched lock freeing has led to
significantly lower rates of softlockups, while incurring relatively small
regressions (relative to the workload and relative to the variation).
The following are a few synthetic benchmarks, made on a machine with
250G RAM, 179G swap, and 176 CPUs.
stress-ng --vm 50 --vm-bytes 5G -M -t 100
+----------------------+---------------+----------+
| Metric | Variation (%) | Delta(%) |
+----------------------+---------------+----------+
| bogo ops | 0.0120 | -0.0011 |
| bogo ops/s (real) | 0.0109 | -0.0091 |
| bogo ops/s (usr+sys) | 0.5560 | +0.1049 |
+----------------------+---------------+----------+
stress-ng --vm 10 --vm-bytes 30G -M -t 100
+----------------------+---------------+----------+
| Metric | Variation (%) | Delta(%) |
+----------------------+---------------+----------+
| bogo ops | 1.8530 | +0.4728 |
| bogo ops/s (real) | 1.8604 | +0.2029 |
| bogo ops/s (usr+sys) | 1.6054 | -0.6381 |
+----------------------+---------------+----------+
Patch 1 simplifies the return semantics of decay_pcp_high and
refresh_cpu_vm_stats, which makes the change in patch 3 more semantically
accurate.
Patch 2, 3, and 4 each address one caller of free_pcppages_bulk, and ensures
that large values passed to it are batched.
This series is a follow-up to [2], where I attempted to solve the same problem
by relinquishing only the zone lock within free_pcppages_bulk. Because this
approach is different in nature, I decided not to send this as a v2, but
as a separate series altogether.
[1] For instance, during *just* the boot of said large machine, there were
2092 instances of free_pcppages_bulk being called with count > 1000.
[2] https://lore.kernel.org/all/20250818185804.21044-1-joshua.hahnjy@gmail.com/
Joshua Hahn (4):
mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
mm/page_alloc: Perform appropriate batching in drain_pages_zone
mm/page_alloc: Batch page freeing in decay_pcp_high
mm/page_alloc: Batch page freeing in free_frozen_page_commit
include/linux/gfp.h | 2 +-
mm/page_alloc.c | 65 ++++++++++++++++++++++++++++++++-------------
mm/vmstat.c | 26 +++++++++---------
3 files changed, 61 insertions(+), 32 deletions(-)
base-commit: 097a6c336d0080725c626fda118ecfec448acd0f
--
2.47.3
Powered by blists - more mailing lists