lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251014134208.2826738-1-joshua.hahnjy@gmail.com>
Date: Tue, 14 Oct 2025 06:42:08 -0700
From: Joshua Hahn <joshua.hahnjy@...il.com>
To: Hillf Danton <hdanton@...a.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Vlastimil Babka <vbabka@...e.cz>,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	kernel-team@...a.com
Subject: Re: [PATCH v4 0/3] mm/page_alloc: Batch callers of free_pcppages_bulk

On Tue, 14 Oct 2025 19:29:45 +0800 Hillf Danton <hdanton@...a.com> wrote:

> On Mon, 13 Oct 2025 12:08:08 -0700 Joshua Hahn wrote:
> > Motivation & Approach
> > =====================
> > 
> > While testing workloads with high sustained memory pressure on large machines
> > in the Meta fleet (1Tb memory, 316 CPUs), we saw an unexpectedly high number
> > of softlockups. Further investigation showed that the zone lock in
> > free_pcppages_bulk was being held for a long time, and was called to free
> > 2k+ pages over 100 times just during boot.
> > 
> > This causes starvation in other processes for the zone lock, which can lead
> > to the system stalling as multiple threads cannot make progress without the
> > locks. We can see these issues manifesting as warnings:
> > 
> > [ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
> > [ 4512.604370] rcu:     20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
> > [ 4512.626401] rcu:              hardirqs   softirqs   csw/system
> > [ 4512.638793] rcu:      number:        0        145            0
> > [ 4512.651177] rcu:     cputime:       30      10410          174   ==> 10558(ms)
> > [ 4512.666657] rcu:     (t=21077 jiffies g=783665 q=1242213 ncpus=316)

Hello Hillf, thank you for your review.

> > While these warnings are benign, they do point to the underlying issue of
> 
> No fix is needed if it is benign.

Maybe this is poor wording on my part. What I mean to say is that these
warning messages can help us understand that the system is trending negatively,
even though the warning messages themselves may not indiate that something
has crashed or broken completely.

> > lock contention. To prevent starvation in both locks, batch the freeing of
> > pages using pcp->batch.
> > 
> > Because free_pcppages_bulk is called with the pcp lock and acquires the zone
> > lock, relinquishing and reacquiring the locks are only effective when both of
> > them are broken together (unless the system was built with queued spinlocks).
> > Thus, instead of modifying free_pcppages_bulk to break both locks, batch the
> > freeing from its callers instead.
> > 
> > A similar fix has been implemented in the Meta fleet, and we have seen
> > significantly less softlockups.
> > 
> Fine, softlockup is not cured.
> 
> > Testing
> > =======
> > The following are a few synthetic benchmarks, made on three machines. The
> > first is a large machine with 754GiB memory and 316 processors.
> > The second is a relatively smaller machine with 251GiB memory and 176
> > processors. The third and final is the smallest of the three, which has 62GiB
> > memory and 36 processors.
> > 
> > On all machines, I kick off a kernel build with -j$(nproc).
> > Negative delta is better (faster compilation).
> > 
> > Large machine (754GiB memory, 316 processors)
> > make -j$(nproc)
> > +------------+---------------+-----------+
> > | Metric (s) | Variation (%) | Delta(%)  |
> > +------------+---------------+-----------+
> > | real       |        0.8070 |  - 1.4865 |
> > | user       |        0.2823 |  + 0.4081 |
> > | sys        |        5.0267 |  -11.8737 |
> > +------------+---------------+-----------+
> > 
> > Medium machine (251GiB memory, 176 processors)
> > make -j$(nproc)
> > +------------+---------------+----------+
> > | Metric (s) | Variation (%) | Delta(%) |
> > +------------+---------------+----------+
> > | real       |        0.2806 |  +0.0351 |
> > | user       |        0.0994 |  +0.3170 |
> > | sys        |        0.6229 |  -0.6277 |
> > +------------+---------------+----------+
> > 
> > Small machine (62GiB memory, 36 processors)
> > make -j$(nproc)
> > +------------+---------------+----------+
> > | Metric (s) | Variation (%) | Delta(%) |
> > +------------+---------------+----------+
> > | real       |        0.1503 |  -2.6585 |
> > | user       |        0.0431 |  -2.2984 |
> > | sys        |        0.1870 |  -3.2013 |
> > +------------+---------------+----------+
> > 
> > Here, variation is the coefficient of variation, i.e. standard deviation / mean.
> > 
> > Based on these results, it seems like there are varying degrees to how much
> > lock contention this reduces. For the largest and smallest machines that I ran
> > the tests on, it seems like there is quite some significant reduction. There
> > is also some performance increases visible from userspace.
> > 
> > Interestingly, the performance gains don't scale with the size of the machine,
> > but rather there seems to be a dip in the gain there is for the medium-sized
> > machine.
> >
> Explaining the dip helps land this work in the next tree.

I do agree that I left this on a bit of a cliffhanger here. I'm a bit confused
as to why there is this kind of behavior as well, although I have a theory
as to why this behavior is seen. Going back to why we see zone lock contention
in the first place, I think it might have to do with the memory vs. processors
ratio that leads to such contention issues.

The lower the memory:processor ratio is, it seems like there is already
less zone lock contention. If we rank these machines by their mem:proc ratio:

Large machine : 2.38
Small machine : 1.72
Medium machine: 1.42

It seems like this is the order in which we see the gains as well. I this
explanation also kind of makes sense -- the more memory we have, the more
memory each pcp will have, and the longer free_pcppages_bulk would have taken
before (and vice versa). This is the case, at least for my setup, where each
machine is onlined in one node (zone) and so the pcp watermarks really
do scale with the size of the system.

I didn't want to include this in the cover letter, because this was purely an
untested conjecture.

I hope this helps!
Joshua

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ