[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251014112946.8581-1-hdanton@sina.com>
Date: Tue, 14 Oct 2025 19:29:45 +0800
From: Hillf Danton <hdanton@...a.com>
To: Joshua Hahn <joshua.hahnjy@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Johannes Weiner <hannes@...xchg.org>,
Vlastimil Babka <vbabka@...e.cz>,
linux-kernel@...r.kernel.org,
linux-mm@...ck.org,
kernel-team@...a.com
Subject: Re: [PATCH v4 0/3] mm/page_alloc: Batch callers of free_pcppages_bulk
On Mon, 13 Oct 2025 12:08:08 -0700 Joshua Hahn wrote:
> Motivation & Approach
> =====================
>
> While testing workloads with high sustained memory pressure on large machines
> in the Meta fleet (1Tb memory, 316 CPUs), we saw an unexpectedly high number
> of softlockups. Further investigation showed that the zone lock in
> free_pcppages_bulk was being held for a long time, and was called to free
> 2k+ pages over 100 times just during boot.
>
> This causes starvation in other processes for the zone lock, which can lead
> to the system stalling as multiple threads cannot make progress without the
> locks. We can see these issues manifesting as warnings:
>
> [ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
> [ 4512.604370] rcu: 20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
> [ 4512.626401] rcu: hardirqs softirqs csw/system
> [ 4512.638793] rcu: number: 0 145 0
> [ 4512.651177] rcu: cputime: 30 10410 174 ==> 10558(ms)
> [ 4512.666657] rcu: (t=21077 jiffies g=783665 q=1242213 ncpus=316)
>
> While these warnings are benign, they do point to the underlying issue of
No fix is needed if it is benign.
> lock contention. To prevent starvation in both locks, batch the freeing of
> pages using pcp->batch.
>
> Because free_pcppages_bulk is called with the pcp lock and acquires the zone
> lock, relinquishing and reacquiring the locks are only effective when both of
> them are broken together (unless the system was built with queued spinlocks).
> Thus, instead of modifying free_pcppages_bulk to break both locks, batch the
> freeing from its callers instead.
>
> A similar fix has been implemented in the Meta fleet, and we have seen
> significantly less softlockups.
>
Fine, softlockup is not cured.
> Testing
> =======
> The following are a few synthetic benchmarks, made on three machines. The
> first is a large machine with 754GiB memory and 316 processors.
> The second is a relatively smaller machine with 251GiB memory and 176
> processors. The third and final is the smallest of the three, which has 62GiB
> memory and 36 processors.
>
> On all machines, I kick off a kernel build with -j$(nproc).
> Negative delta is better (faster compilation).
>
> Large machine (754GiB memory, 316 processors)
> make -j$(nproc)
> +------------+---------------+-----------+
> | Metric (s) | Variation (%) | Delta(%) |
> +------------+---------------+-----------+
> | real | 0.8070 | - 1.4865 |
> | user | 0.2823 | + 0.4081 |
> | sys | 5.0267 | -11.8737 |
> +------------+---------------+-----------+
>
> Medium machine (251GiB memory, 176 processors)
> make -j$(nproc)
> +------------+---------------+----------+
> | Metric (s) | Variation (%) | Delta(%) |
> +------------+---------------+----------+
> | real | 0.2806 | +0.0351 |
> | user | 0.0994 | +0.3170 |
> | sys | 0.6229 | -0.6277 |
> +------------+---------------+----------+
>
> Small machine (62GiB memory, 36 processors)
> make -j$(nproc)
> +------------+---------------+----------+
> | Metric (s) | Variation (%) | Delta(%) |
> +------------+---------------+----------+
> | real | 0.1503 | -2.6585 |
> | user | 0.0431 | -2.2984 |
> | sys | 0.1870 | -3.2013 |
> +------------+---------------+----------+
>
> Here, variation is the coefficient of variation, i.e. standard deviation / mean.
>
> Based on these results, it seems like there are varying degrees to how much
> lock contention this reduces. For the largest and smallest machines that I ran
> the tests on, it seems like there is quite some significant reduction. There
> is also some performance increases visible from userspace.
>
> Interestingly, the performance gains don't scale with the size of the machine,
> but rather there seems to be a dip in the gain there is for the medium-sized
> machine.
>
Explaining the dip helps land this work in the next tree.
Powered by blists - more mailing lists