linux-kernel - Re: [PATCH v11 00/15] khugepaged: mTHP support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAA1CXcDyTR64jdhZPae2HPYOwsUxU1R1tj1hMeE=vV_ey9GXsg@mail.gmail.com>
Date: Fri, 12 Sep 2025 18:28:55 -0600
From: Nico Pache <npache@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: David Hildenbrand <david@...hat.com>, Kiryl Shutsemau <kas@...nel.org>, linux-mm@...ck.org, 
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, 
	linux-trace-kernel@...r.kernel.org, ziy@...dia.com, 
	baolin.wang@...ux.alibaba.com, Liam.Howlett@...cle.com, ryan.roberts@....com, 
	dev.jain@....com, corbet@....net, rostedt@...dmis.org, mhiramat@...nel.org, 
	mathieu.desnoyers@...icios.com, akpm@...ux-foundation.org, baohua@...nel.org, 
	willy@...radead.org, peterx@...hat.com, wangkefeng.wang@...wei.com, 
	usamaarif642@...il.com, sunnanyong@...wei.com, vishal.moola@...il.com, 
	thomas.hellstrom@...ux.intel.com, yang@...amperecomputing.com, 
	aarcange@...hat.com, raquini@...hat.com, anshuman.khandual@....com, 
	catalin.marinas@....com, tiwai@...e.de, will@...nel.org, 
	dave.hansen@...ux.intel.com, jack@...e.cz, cl@...two.org, jglisse@...gle.com, 
	surenb@...gle.com, zokeefe@...gle.com, hannes@...xchg.org, 
	rientjes@...gle.com, mhocko@...e.com, rdunlap@...radead.org, hughd@...gle.com, 
	richard.weiyang@...il.com, lance.yang@...ux.dev, vbabka@...e.cz, 
	rppt@...nel.org, jannh@...gle.com, pfalcato@...e.de
Subject: Re: [PATCH v11 00/15] khugepaged: mTHP support

On Fri, Sep 12, 2025 at 12:22 PM Lorenzo Stoakes
<lorenzo.stoakes@...cle.com> wrote:
>
> On Fri, Sep 12, 2025 at 07:53:22PM +0200, David Hildenbrand wrote:
> > On 12.09.25 17:51, Lorenzo Stoakes wrote:
> > > With all this stuff said, do we have an actual plan for what we intend to do
> > > _now_?
> >
> > Oh no, no I have to use my brain and it's Friday evening.
>
> I apologise :)
>
> >
> > >
> > > As Nico has implemented a basic solution here that we all seem to agree is not
> > > what we want.
> > >
> > > Without needing special new hardware or major reworks, what would this parameter
> > > look like?
> > >
> > > What would the heuristics be? What about the eagerness scales?
> > >
> > > I'm but a simple kernel developer,
> >
> > :)
> >
> > and interested in simple pragmatic stuff :)
> > > do you have a plan right now David?
> >
> > Ehm, if you ask me that way ...
> >
> > >
> > > Maybe we can start with something simple like a rough percentage per eagerness
> > > entry that then gets scaled based on utilisation?
> >
> > ... I think we should probably:
> >
> > 1) Start with something very simple for mTHP that doesn't lock us into any particular direction.
>
> Yes.
>
> >
> > 2) Add an "eagerness" parameter with fixed scale and use that for mTHP as well
>
> Yes I think we're all pretty onboard with that it seems!
>
> >
> > 3) Improve that "eagerness" algorithm using a dynamic scale or #whatever
>
> Right, I feel like we could start with some very simple linear thing here and
> later maybe refine it?

I agree, something like 0,32,64,128,255,511 seem to map well, and is
not too different from what im doing with the scaling by
(HPAGE_PMD_ORDER - order).

>
> >
> > 4) Solve world peace and world hunger
>
> Yes! That would be pretty great ;)
This should probably be a larger priority
>
> >
> > 5) Connect it all to memory pressure / reclaim / shrinker / heuristics / hw hotness / #whatever
>
> I think these are TODOs :)
>
> >
> >
> > I maintain my initial position that just using
> >
> > max_ptes_none == 511 -> collapse mTHP always
> > max_ptes_none != 511 -> collapse mTHP only if we all PTEs are non-none/zero
> >
> > As a starting point is probably simple and best, and likely leaves room for any
> > changes later.
>
> Yes.
>
> >
> >
> > Of course, we could do what Nico is proposing here, as 1) and change it all later.
>
> Right.
>
> But that does mean for mTHP we're limited to 256 (or 255 was it?) but I guess
> given the 'creep' issue that's sensible.

I dont think thats much different to what david is trying to propose,
given eagerness=9 would be 50%.
at 10 or 511, no matter what, you will only ever collapse to the
largest enabled order.
The difference in my approach is that technically, with PMD disabled,
and 511, you would still need 50% utilization to collapse, which is
not ideal if you always want to collapse to some mTHP size even with 1
page occupied. With davids solution this is solved by never allowing
anything in between 255-511.

>
> >
> > It's just when it comes to documenting all that stuff in patch #15 that I feel like
> > "alright, we shouldn't be doing it longterm like that, so let's not make anybody
> > depend on any weird behavior here by over-domenting it".
> >
> > I mean
> >
> > "
> > +To prevent "creeping" behavior where collapses continuously promote to larger
> > +orders, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on 4K page size), it is
> > +capped to HPAGE_PMD_NR/2 - 1 for mTHP collapses. This is due to the fact
> > +that introducing more than half of the pages to be non-zero it will always
> > +satisfy the eligibility check on the next scan and the region will be collapse.
> > "
> >
> > Is just way, way to detailed.
> >
> > I would just say "The kernel might decide to use a more conservative approach
> > when collapsing smaller THPs" etc.
> >
> >
> > Thoughts?
>
> Well I've sort of reviewed oppositely there :) well at least that it needs to be
> a hell of a lot clearer (I find that comment really compressed and I just don't
> really understand it).

I think your review is still valid to improve the internal code
comment. I think David is suggesting to not be so specific in the
actual admin-guide docs as we move towards a more opaque tunable.

>
> I guess I didn't think about people reading that and relying on it, so maybe we
> could alternatively make that succinct.
>
> But I think it'd be better to say something like "mTHP collapse cannot currently
> correctly function with half or more of the PTE entries empty, so we cap at just
> below this level" in this case.

Some middle ground might be the best answer, not too specific, but
also allude to the interworking a little.

Cheers,
-- Nico
>
> >
> > --
> > Cheers
> >
> > David / dhildenb
> >
>
> Cheers, Lorenzo
>