linux-kernel - Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250925184446.200563-1-joshua.hahnjy@gmail.com>
Date: Thu, 25 Sep 2025 11:44:46 -0700
From: Joshua Hahn <joshua.hahnjy@...il.com>
To: "Christoph Lameter (Ampere)" <cl@...two.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Chris Mason <clm@...com>,
	Kiryl Shutsemau <kirill@...temov.name>,
	Brendan Jackman <jackmanb@...gle.com>,
	Michal Hocko <mhocko@...e.com>,
	Suren Baghdasaryan <surenb@...gle.com>,
	Vlastimil Babka <vbabka@...e.cz>,
	Zi Yan <ziy@...dia.com>,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	kernel-team@...a.com
Subject: Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone

On Wed, 24 Sep 2025 16:09:47 -0700 (PDT) "Christoph Lameter (Ampere)" <cl@...two.org> wrote:

> On Wed, 24 Sep 2025, Joshua Hahn wrote:
> 
> > drain_pages_zone completely drains a zone of its pcp free pages by
> > repeatedly calling free_pcppages_bulk until pcp->count reaches 0.
> > In this loop, it already performs batched calls to ensure that
> > free_pcppages_bulk isn't called to free too many pages at once, and
> > relinquishes & reacquires the lock between each call to prevent
> > lock starvation from other processes. 

Hello Christoph,

Thank you for your thoughtful review. 

> drain_pages_zone() operates on a lock in a percpu area. The lock is
> specific to a cpu and should not be contended in normal operatons unless
> there is significant remote access to the per cpu queues.

When you say "the lock", I imagine you are referring to the pcp lock.
However, there is one more lock that we have to think about, which is the
zone lock which drain_pages_zone-->free_pcppages_bulk locks & frees. Perhaps
I can be more explicit about this goal in the commit message.

Actually, in the original version of this patch [1] the goal was to
relieve the tension in the zone lock, but since it was a nested lock under
the pcp lock, it was required to free both of these in order for the
task to be rescheduled and other processes to come in and grab the zone lock.

So from what I can understand, it seems like there truly is lock contention,
just not on the pcp lock. Please let me know if I have an incorrect
understanding of the code here.

> This seems to be then from __drain_all_pages() running on multiple cpus
> frequently. There is no point in concurrently draining the per cpu pages
> of all processors from multiple remote processors and we have a
> pcpu_drain_mutex to prevent that from happening.

Agreed.

> So we need an explanation as to why there is such high contention on the
> lock first before changing the logic here.
> 
> The current logic seems to be designed to prevent the lock contention you
> are seeing.

This is true, but my concern was mostly with the value that is being used
for the batching (2048 seems too high). But as I explain below, it seems
like the min(2048, count) operation is a no-op anyways, since it is never
called with count > 1000 (at least from the benchmarks that I was running,
on my machine).

To demonstrate where the contention / offenders of trying to free too many
pages at once, I compiled the upstream kernel and made a histogram of
calls to free_pcppages_bulk where count > 1000.
On the left is the value of count passed to the function, and on the right we
have the frequency. This is on a 250G memory, 179G swap, 176 CPU machine,
running a memory intensive task for ~12 minutes.

free_frozen_page_commit
-----------------------
1000-1250| *          (3732)
1250-1500| *          (3329)
1500-1750| *          (3102)
1750-2000| *          (2878)
2000-2250| ********** (28456)

decay_pcp_high
--------------
1000-1250| ********** (837)
1250-1500| ******     (540)
1500-1750| ****       (337)
1750-2000| **         (249)
2000-2250| ******     (547)

So it seems like drain_pages_zone does not even call free_pcppages_bulk with
a high count. In fact, it seems like free_frozen_page_commit is truly the
biggest offender here.

When I was writing this patch, I was doing it for completeness and ensuring
that all callers of free_pcppages_bulk have a limit to how many pages it
can try to free at once.

I am happy to drop this patch from v3, if you feel that it is an inappropriate
change. From what I can see from the above numbers, it doesnt seem like
there will be much impact on performance, anyways (at least on my machine,
with the same benchmark. Perhaps bigger machines or more memory-intensive
benchmarks will request for larger drains?)

Thank you again for your review, and please feel free to correct me on anything
that I may be understanding incorrectly. Have a great day!
Joshua

[1] https://lore.kernel.org/all/20250818185804.21044-1-joshua.hahnjy@gmail.com/