linux-kernel - Re: [RFC] [PATCH] mm/page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87ms60wzni.fsf@DESKTOP-5N7EMDA>
Date: Thu, 09 Oct 2025 10:57:05 +0800
From: "Huang, Ying" <ying.huang@...ux.alibaba.com>
To: Joshua Hahn <joshua.hahnjy@...il.com>
Cc: Dave Hansen <dave.hansen@...el.com>,  Andrew Morton
 <akpm@...ux-foundation.org>,  Brendan Jackman <jackmanb@...gle.com>,
  Johannes Weiner <hannes@...xchg.org>,  Michal Hocko <mhocko@...e.com>,
  Suren Baghdasaryan <surenb@...gle.com>,  Vlastimil Babka
 <vbabka@...e.cz>,  Zi Yan <ziy@...dia.com>,  linux-kernel@...r.kernel.org,
  linux-mm@...ck.org
Subject: Re: [RFC] [PATCH] mm/page_alloc: pcp->batch tuning

Hi, Joshua,

Joshua Hahn <joshua.hahnjy@...il.com> writes:

> On Wed, 8 Oct 2025 08:34:21 -0700 Dave Hansen <dave.hansen@...el.com> wrote:
>
> Hello Dave, thank you for your feedback!
>
>> First of all, I do agree that the comment should go away or get fixed up.
>> 
>> But...
>> 
>> On 10/6/25 07:54, Joshua Hahn wrote:
>> > This leaves us with a /= 4 with no corresponding *= 4 anywhere, which
>> > leaves pcp->batch mistuned from the original intent when it was
>> > introduced. This is made worse by the fact that pcp lists are generally
>> > larger today than they were in 2013, meaning batch sizes should have
>> > increased, not decreased.
>> 
>> pcp->batch and pcp->high do very different things. pcp->high is a limit
>> on the amount of memory that can be tied up. pcp->batch balances
>> throughput with latency. I'm not sure I buy the idea that a higher
>> pcp->high means we should necessarily do larger batches.
>
> I agree with your observation that a higher pcp->high doesn't mean we should
> do larger batches. I think what I was trying to get at here was that if
> pcp lists are bigger, some other values might want to scale.
>
> For instance, in nr_pcp_free, pcp->batch is used to determine how many
> pages should be left in the pcplist (and the rest be freed). Should this
> value scale with a bigger pcp? (This is not a rhetorical question, I really
> do want to understand what the implications are here).
>
> Another thing that I would like to note is that pcp->high is actually at
> least in part a function of pcp->batch. In decay_pcp_high, we set
>
> pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX), ...)
>
> So here, it seems like a higher batch value would actually lead to a much
> lower pcp->high instead. This actually seems actively harmful to the system.

Batch here is used to control the latency to free the pages from PCP to
buddy.  Larger batch will lead to larger latency, however it helps to
reduce the size of PCP more quickly when it becomes idle.  So, we need
to balance here.

> So I'll do a take two of this patch and take your advice below and instead
> of getting rid of the /= 4, just fold it in (or add a better explanation)
> as to why we do this. Another candidate place to do this seems to be
> where we do the rounddown_pow_of_two.
>
>> So I dunno... f someone wanted to alter the initial batch size, they'd
>> ideally repeat some of Ying's experiments from: 52166607ecc9 ("mm:
>> restrict the pcp batch scale factor to avoid too long latency").
>
> I ran a few very naive and quick tests on kernel builds, and it seems like
> for larger machines (1TB memory, 316 processors), this leads to a very
> significant speedup in system time during a kernel compilation (~10%).
>
> But for smaller machines (250G memory, 176 processors) and (62G memory and 36
> processors), this leads to quite a regression (~5%).
>
> So maybe the answer is that this should actually be defined by the machine's
> size. In zone_batchsize, we set the value of the batch to: 
>
> min(zone_managed_pages(zone) >> 10, SZ_1M / PAGE_SIZE)
>
> But maybe it makes sense to let this value grow bigger for larger machines? If
> anything, I think that the experiment results above do show that batch size does
> have an impact on the performance, and the effect can either be positive or
> negative based on the machine's size. I can run some more experiments to 
> see if there's an opportunity to better tune pcp->batch.

In fact, we do have some mechanism to scale batch size dynamically
already, via pcp->alloc_factor and pcp->free_count.

You could further tune them.  Per my understanding, it should be a
balance between throughput and latency.

>> Better yet, just absorb the /=4 into the two existing batch assignments.
>> It will probably compile to exactly the same code and have no functional
>> changes and get rid of the comment.
>> 
>> Wouldn't this compile to the same thing?
>> 
>>         batch = zone->managed_pages / 4096;
>>         if (batch * PAGE_SIZE > 128 * 1024)
>>                 batch = (128 * 1024) / PAGE_SIZE;
>
> But for now, this seems good to me. I'll get rid of the confusing comment,
> and try to fold in the batch value and leave a new comment leaving this
> as an explanation. 
>
> Thank you for your thoughtful review, Dave. I hope you have a great day!
> Joshua

---
Best Regards,
Huang, Ying