linux-kernel - Re: [PATCH v11 00/15] khugepaged: mTHP support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <41d9c544-361f-4457-a53e-023b8db8c707@lucifer.local>
Date: Mon, 15 Sep 2025 11:44:50 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Nico Pache <npache@...hat.com>
Cc: David Hildenbrand <david@...hat.com>, Kiryl Shutsemau <kas@...nel.org>,
        linux-mm@...ck.org, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
        ziy@...dia.com, baolin.wang@...ux.alibaba.com, Liam.Howlett@...cle.com,
        ryan.roberts@....com, dev.jain@....com, corbet@....net,
        rostedt@...dmis.org, mhiramat@...nel.org,
        mathieu.desnoyers@...icios.com, akpm@...ux-foundation.org,
        baohua@...nel.org, willy@...radead.org, peterx@...hat.com,
        wangkefeng.wang@...wei.com, usamaarif642@...il.com,
        sunnanyong@...wei.com, vishal.moola@...il.com,
        thomas.hellstrom@...ux.intel.com, yang@...amperecomputing.com,
        aarcange@...hat.com, raquini@...hat.com, anshuman.khandual@....com,
        catalin.marinas@....com, tiwai@...e.de, will@...nel.org,
        dave.hansen@...ux.intel.com, jack@...e.cz, cl@...two.org,
        jglisse@...gle.com, surenb@...gle.com, zokeefe@...gle.com,
        hannes@...xchg.org, rientjes@...gle.com, mhocko@...e.com,
        rdunlap@...radead.org, hughd@...gle.com, richard.weiyang@...il.com,
        lance.yang@...ux.dev, vbabka@...e.cz, rppt@...nel.org,
        jannh@...gle.com, pfalcato@...e.de
Subject: Re: [PATCH v11 00/15] khugepaged: mTHP support

On Fri, Sep 12, 2025 at 06:28:55PM -0600, Nico Pache wrote:
> On Fri, Sep 12, 2025 at 12:22 PM Lorenzo Stoakes
> <lorenzo.stoakes@...cle.com> wrote:
> >
> > On Fri, Sep 12, 2025 at 07:53:22PM +0200, David Hildenbrand wrote:
> > > On 12.09.25 17:51, Lorenzo Stoakes wrote:
> > > > With all this stuff said, do we have an actual plan for what we intend to do
> > > > _now_?
> > >
> > > Oh no, no I have to use my brain and it's Friday evening.
> >
> > I apologise :)
> >
> > >
> > > >
> > > > As Nico has implemented a basic solution here that we all seem to agree is not
> > > > what we want.
> > > >
> > > > Without needing special new hardware or major reworks, what would this parameter
> > > > look like?
> > > >
> > > > What would the heuristics be? What about the eagerness scales?
> > > >
> > > > I'm but a simple kernel developer,
> > >
> > > :)
> > >
> > > and interested in simple pragmatic stuff :)
> > > > do you have a plan right now David?
> > >
> > > Ehm, if you ask me that way ...
> > >
> > > >
> > > > Maybe we can start with something simple like a rough percentage per eagerness
> > > > entry that then gets scaled based on utilisation?
> > >
> > > ... I think we should probably:
> > >
> > > 1) Start with something very simple for mTHP that doesn't lock us into any particular direction.
> >
> > Yes.
> >
> > >
> > > 2) Add an "eagerness" parameter with fixed scale and use that for mTHP as well
> >
> > Yes I think we're all pretty onboard with that it seems!
> >
> > >
> > > 3) Improve that "eagerness" algorithm using a dynamic scale or #whatever
> >
> > Right, I feel like we could start with some very simple linear thing here and
> > later maybe refine it?
>
> I agree, something like 0,32,64,128,255,511 seem to map well, and is
> not too different from what im doing with the scaling by
> (HPAGE_PMD_ORDER - order).

Actually, I suspect something like what David suggests in [0] is probably the
better way, but as I said there I think it should be an internal implementation
detail as to what this ultimately ends up being.

The idea is we provide an abstract thing a user can set, and the kernel figures
out how best to interpret that.

[0]:https://lore.kernel.org/linux-mm/cd8e7f1c-a563-4ae9-a0fb-b0d04a4c35b4@redhat.com/

>
> >
> > >
> > > 4) Solve world peace and world hunger
> >
> > Yes! That would be pretty great ;)
> This should probably be a larger priority

:)))

> >
> > >
> > > 5) Connect it all to memory pressure / reclaim / shrinker / heuristics / hw hotness / #whatever
> >
> > I think these are TODOs :)
> >
> > >
> > >
> > > I maintain my initial position that just using
> > >
> > > max_ptes_none == 511 -> collapse mTHP always
> > > max_ptes_none != 511 -> collapse mTHP only if we all PTEs are non-none/zero
> > >
> > > As a starting point is probably simple and best, and likely leaves room for any
> > > changes later.
> >
> > Yes.
> >
> > >
> > >
> > > Of course, we could do what Nico is proposing here, as 1) and change it all later.
> >
> > Right.
> >
> > But that does mean for mTHP we're limited to 256 (or 255 was it?) but I guess
> > given the 'creep' issue that's sensible.
>
> I dont think thats much different to what david is trying to propose,
> given eagerness=9 would be 50%.

I think q

> at 10 or 511, no matter what, you will only ever collapse to the
> largest enabled order.
> The difference in my approach is that technically, with PMD disabled,
> and 511, you would still need 50% utilization to collapse, which is
> not ideal if you always want to collapse to some mTHP size even with 1
> page occupied. With davids solution this is solved by never allowing
> anything in between 255-511.

Right. Except we default to max eagerness (or min, I asked David about the
values there :P)

So aren't we, by default, broken on mTHP? Maybe we can change the default though...

>
> >
> > >
> > > It's just when it comes to documenting all that stuff in patch #15 that I feel like
> > > "alright, we shouldn't be doing it longterm like that, so let's not make anybody
> > > depend on any weird behavior here by over-domenting it".
> > >
> > > I mean
> > >
> > > "
> > > +To prevent "creeping" behavior where collapses continuously promote to larger
> > > +orders, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on 4K page size), it is
> > > +capped to HPAGE_PMD_NR/2 - 1 for mTHP collapses. This is due to the fact
> > > +that introducing more than half of the pages to be non-zero it will always
> > > +satisfy the eligibility check on the next scan and the region will be collapse.
> > > "
> > >
> > > Is just way, way to detailed.
> > >
> > > I would just say "The kernel might decide to use a more conservative approach
> > > when collapsing smaller THPs" etc.
> > >
> > >
> > > Thoughts?
> >
> > Well I've sort of reviewed oppositely there :) well at least that it needs to be
> > a hell of a lot clearer (I find that comment really compressed and I just don't
> > really understand it).
>
> I think your review is still valid to improve the internal code
> comment. I think David is suggesting to not be so specific in the
> actual admin-guide docs as we move towards a more opaque tunable.

Yeah thanks for pointing that out! We were talking across purposes.

>
> >
> > I guess I didn't think about people reading that and relying on it, so maybe we
> > could alternatively make that succinct.
> >
> > But I think it'd be better to say something like "mTHP collapse cannot currently
> > correctly function with half or more of the PTE entries empty, so we cap at just
> > below this level" in this case.
>
> Some middle ground might be the best answer, not too specific, but
> also allude to the interworking a little.

Yeah actually I agree with David re: documentation, my comments were wrt
err... comments :P only.

>
> Cheers,
> -- Nico

Cheers, Lorenzo