linux-kernel - Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250930144240.2326093-1-joshua.hahnjy@gmail.com>
Date: Tue, 30 Sep 2025 07:42:40 -0700
From: Joshua Hahn <joshua.hahnjy@...il.com>
To: Hillf Danton <hdanton@...a.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	kernel-team@...a.com
Subject: Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone

On Sat, 27 Sep 2025 08:46:15 +0800 Hillf Danton <hdanton@...a.com> wrote:

> On Wed, 24 Sep 2025 13:44:06 -0700 Joshua Hahn wrote:
> > drain_pages_zone completely drains a zone of its pcp free pages by
> > repeatedly calling free_pcppages_bulk until pcp->count reaches 0.
> > In this loop, it already performs batched calls to ensure that
> > free_pcppages_bulk isn't called to free too many pages at once, and
> > relinquishes & reacquires the lock between each call to prevent
> > lock starvation from other processes.
> > 
> > However, the current batching does not prevent lock starvation. The
> > current implementation creates batches of
> > pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX, which has been seen in
> > Meta workloads to be up to 64 << 5 == 2048 pages.
> > 
> > While it is true that CONFIG_PCP_BATCH_SCALE_MAX is a config and
> > indeed can be adjusted by the system admin to be any number from
> > 0 to 6, it's default value of 5 is still too high to be reasonable for
> > any system.
> > 
> > Instead, let's create batches of pcp->batch pages, which gives a more
> > reasonable 64 pages per call to free_pcppages_bulk. This gives other
> > processes a chance to grab the lock and prevents starvation. Each

Hello Hillf,

Thank you for your feedback!

> Feel free to make it clear, which lock is contended, pcp->lock or
> zone->lock, or both, to help understand the starvation.

Sorry for the late reply. I took some time to run some more tests and
gather numbers so that I could give an accurate representation of what
I was seeing in these systems.

So running perf lock con -abl on my system and compiling the kernel,
I see that the biggest lock contentions come from free_pcppages_bulk
and __rmqueue_pcplist on the upstream kernel (ignoring lock contentions
on lruvec, which is actually the biggest offender on these systems.
This will hopefully be addressed some time in the future as well).

Looking deeper into where they are waiting on the lock, I found that they
are both waiting for the zone->lock (not the pcp lock, even for
__rmqueue_pcplist). I'll add this detail into v3, so that it is more
clear for the user. I'll also emphasize why we still need to break the
pcp lock, since this was something that wasn't immediately obvious to me.

> If the zone lock is hot, why did numa node fail to mitigate the contension,
> given workloads tested with high sustained memory pressure on large machines
> in the Meta fleet (1Tb memory, 316 CPUs)?

This is a good question. On this system, I've configured the machine to only
use 1 node/zone, so there is no ability to migrate the contention. Perhaps
another approach to this problem would be to encourage the user to
configure the system such that each NUMA node does not exceed N GB of memory?

But if so -- how many GB/node is too much? It seems like there would be
some sweet spot where the overhead required to maintain many nodes
cancels out with the benefits one would get from splitting the system into
multiple nodes. What do you think? Personally, I think that this patchset
(not this patch, since it will be dropped in v3) still provides value in
the form of preventing lock monopolies in the zone lock even in a system
where memory is spread out across more nodes.

> Can the contension be observed with tight memory pressure but not highly tight? 
> If not, it is due to misconfigure in the user space, no?

I'm not sure I entirely follow what you mean here, but are you asking
whether this is a userspace issue for running a workload that isn't
properly sized for the system? Perhaps that could be the case, but I think
that it would not be bad for the system to protect against these workloads
which can cause the system to stall, especially since the numbers show
that there is neutral to positive gains from this patch. But of course
I am very biased in this opinion : -) so happy to take other opinions on
this matter.

Thanks for your questions. I hope you have a great day!
Joshua