linux-kernel - [PATCH v2 0/4] mm/page_alloc: Batch callers of free_pcppages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250924204409.1706524-1-joshua.hahnjy@gmail.com>
Date: Wed, 24 Sep 2025 13:44:04 -0700
From: Joshua Hahn <joshua.hahnjy@...il.com>
To: Andrew Morton <akpm@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>
Cc: Chris Mason <clm@...com>,
	Kiryl Shutsemau <kirill@...temov.name>,
	"Liam R. Howlett" <Liam.Howlett@...cle.com>,
	Brendan Jackman <jackmanb@...gle.com>,
	David Hildenbrand <david@...hat.com>,
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	Michal Hocko <mhocko@...e.com>,
	Mike Rapoport <rppt@...nel.org>,
	Suren Baghdasaryan <surenb@...gle.com>,
	Vlastimil Babka <vbabka@...e.cz>,
	Zi Yan <ziy@...dia.com>,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org
Subject: [PATCH v2 0/4] mm/page_alloc: Batch callers of free_pcppages_bulk

Motivation & Approach
=====================

While testing workloads with high sustained memory pressure on large machines
in the Meta fleet (1Tb memory, 316 CPUs), we saw an unexpectedly high number
of softlockups. Further investigation showed that the lock in
free_pcppages_bulk was being held for a long time, and was called to free
2k+ pages over 100 times just during boot.

This causes starvation in other processes for both the pcp and zone locks,
which can lead to the system stalling as multiple threads cannot make progress
without the locks. We can see these issues manifesting as warnings:

[ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
[ 4512.604370] rcu: 	20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
[ 4512.626401] rcu: 	         hardirqs   softirqs   csw/system
[ 4512.638793] rcu: 	 number:        0        145            0
[ 4512.651177] rcu: 	cputime:       30      10410          174   ==> 10558(ms)
[ 4512.666657] rcu: 	(t=21077 jiffies g=783665 q=1242213 ncpus=316)

While these warnings are benign, they do point to the underlying issue of
lock contention. To prevent starvation in both locks, batch the freeing of
pages using pcp->batch.

Because free_pcppages_bulk is called with both the pcp and zone lock,
relinquishing and reacquiring the locks are only effective when both of them
are broken together (unless the system was built with queued spinlocks).
Thus, instead of modifying free_pcppages_bulk to break both locks, batch the
freeing from its callers instead.

A similar fix has been implemented in the Meta fleet, and we have seen
significantly less softlockups.

Testing
=======
The following are a few synthetic benchmarks, made on a machine with
250G RAM, 179G swap, and 176 CPUs.

stress-ng --vm 50 --vm-bytes 5G -M -t 100
+----------------------+---------------+----------+
|        Metric        | Variation (%) | Delta(%) |
+----------------------+---------------+----------+
| bogo ops             |        0.0216 |  -0.0172 |
| bogo ops/s (real)    |        0.0223 |  -0.0163 |
| bogo ops/s (usr+sys) |        1.3433 |  +1.0769 |
+----------------------+---------------+----------+

stress-ng --vm 10 --vm-bytes 30G -M -t 100
+----------------------+---------------+----------+
|        Metric        | Variation (%) | Delta(%) |
+----------------------+---------------+----------+
| bogo ops             |        2.1736 |  +4.8535 |
| bogo ops/s (real)    |        2.2689 |  +5.1719 |
| bogo ops/s (usr+sys) |        2.1283 |  +0.6587 |
+----------------------+---------------+----------+

It seems like depending on the workload, this patch may lead to an increase
in performance, or stay neutral. I believe this has to do with how much lock
contention there is, and how many free_pcppages_bulk calls were being made
previously with high counts.

The difference between bogo ops/s (real) and (usr+sys) seems to indicate that
there is meaningful difference in the amount of time threads spend blocked
on getting either the pcp or zone lock.

Changelog
=========
v1 --> v2:
- Reworded cover letter to be more explicit about what kinds of issues
  running processes might face as a result of the existing lock starvation
- Reworded cover letter to be in sections to make it easier to read
- Fixed patch 4/4 to properly store & restore UP flags.
- Re-ran tests, updated the testing results and interpretation

Joshua Hahn (4):
  mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
  mm/page_alloc: Perform appropriate batching in drain_pages_zone
  mm/page_alloc: Batch page freeing in decay_pcp_high
  mm/page_alloc: Batch page freeing in free_frozen_page_commit

 include/linux/gfp.h |  2 +-
 mm/page_alloc.c     | 67 ++++++++++++++++++++++++++++++++-------------
 mm/vmstat.c         | 26 +++++++++---------
 3 files changed, 62 insertions(+), 33 deletions(-)

base-commit: 097a6c336d0080725c626fda118ecfec448acd0f
-- 
2.47.3