linux-kernel - Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251001153717.2379348-1-joshua.hahnjy@gmail.com>
Date: Wed,  1 Oct 2025 08:37:16 -0700
From: Joshua Hahn <joshua.hahnjy@...il.com>
To: Hillf Danton <hdanton@...a.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	kernel-team@...a.com
Subject: Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone

Hello Hillf, thanks for your continued interest in this series!

> > Hello Hillf,
> > 
> > Thank you for your feedback!
> > 
> > > Feel free to make it clear, which lock is contended, pcp->lock or
> > > zone->lock, or both, to help understand the starvation.
> > 
> > Sorry for the late reply. I took some time to run some more tests and
> > gather numbers so that I could give an accurate representation of what
> > I was seeing in these systems.
> > 
> > So running perf lock con -abl on my system and compiling the kernel,
> > I see that the biggest lock contentions come from free_pcppages_bulk
> > and __rmqueue_pcplist on the upstream kernel (ignoring lock contentions
> > on lruvec, which is actually the biggest offender on these systems.
> > This will hopefully be addressed some time in the future as well).
> > 
> > Looking deeper into where they are waiting on the lock, I found that they
> > are both waiting for the zone->lock (not the pcp lock, even for
> 
> One of the hottest locks again plays its role.

Indeed...

> > __rmqueue_pcplist). I'll add this detail into v3, so that it is more
> > clear for the user. I'll also emphasize why we still need to break the
> > pcp lock, since this was something that wasn't immediately obvious to me.
> > 
> > > If the zone lock is hot, why did numa node fail to mitigate the contension,
> > > given workloads tested with high sustained memory pressure on large machines
> > > in the Meta fleet (1Tb memory, 316 CPUs)?
> > 
> > This is a good question. On this system, I've configured the machine to only
> > use 1 node/zone, so there is no ability to migrate the contention. Perhaps
> 
> Thanks, we know why the zone lock is hot - 300+ CPUs potentially contended a lock.
> The single node/zone config may explain why no similar reports of large
> systems (1Tb memory, 316 CPUs) emerged a couple of years back, given
> NUMA machine is not anything new on the market.
> 
> > another approach to this problem would be to encourage the user to
> > configure the system such that each NUMA node does not exceed N GB of memory?
> > 
> > But if so -- how many GB/node is too much? It seems like there would be
> > some sweet spot where the overhead required to maintain many nodes
> > cancels out with the benefits one would get from splitting the system into
> > multiple nodes. What do you think? Personally, I think that this patchset
> > (not this patch, since it will be dropped in v3) still provides value in
> > the form of preventing lock monopolies in the zone lock even in a system
> > where memory is spread out across more nodes.
> > 
> > > Can the contension be observed with tight memory pressure but not highly tight? 
> > > If not, it is due to misconfigure in the user space, no?
> > 
> > I'm not sure I entirely follow what you mean here, but are you asking
> > whether this is a userspace issue for running a workload that isn't
> > properly sized for the system? Perhaps that could be the case, but I think
> 
> This is not complicated. Take another look at the system from another
> POV - what is your comment if running the same workload on the same
> system but with RAM cut down to 1Gb? If roughly it is fully loaded for a
> dentist to serve two patients well a day, getting the professional over
> loaded makes no sense I think.
> 
> In short, given the zone lock is hot in nature, soft lockup with reproducer
> hints misconfig.

While I definitely agree that spreading out 1TB across multiple NUMA nodes
is an option that should be considered, I am unsure if it makes sense to
dismiss this issue as simply a misconfiguration problem.

The reality is that these machines do exist, and we see zone lock contention
on these machines. You can also see that I ran performance evaluation tests
on relatively smaller machines (250G) and saw some performance gains.

The other point that I wanted to mention is that simply adding more NUMA
nodes is not always strictly beneficial; it changes how the scheduler
has to work, workloads would require more numa-aware tuning, etc.

Thanks for your feedback, Hillf. I hope you have a great day!
Joshua