lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b1115232-01a8-4799-9ea0-2d6f8fd95a62@lucifer.local>
Date: Thu, 30 Oct 2025 18:03:41 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Nico Pache <npache@...hat.com>
Cc: David Hildenbrand <david@...hat.com>,
        Baolin Wang <baolin.wang@...ux.alibaba.com>,
        linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
        linux-mm@...ck.org, linux-doc@...r.kernel.org, ziy@...dia.com,
        Liam.Howlett@...cle.com, ryan.roberts@....com, dev.jain@....com,
        corbet@....net, rostedt@...dmis.org, mhiramat@...nel.org,
        mathieu.desnoyers@...icios.com, akpm@...ux-foundation.org,
        baohua@...nel.org, willy@...radead.org, peterx@...hat.com,
        wangkefeng.wang@...wei.com, usamaarif642@...il.com,
        sunnanyong@...wei.com, vishal.moola@...il.com,
        thomas.hellstrom@...ux.intel.com, yang@...amperecomputing.com,
        kas@...nel.org, aarcange@...hat.com, raquini@...hat.com,
        anshuman.khandual@....com, catalin.marinas@....com, tiwai@...e.de,
        will@...nel.org, dave.hansen@...ux.intel.com, jack@...e.cz,
        cl@...two.org, jglisse@...gle.com, surenb@...gle.com,
        zokeefe@...gle.com, hannes@...xchg.org, rientjes@...gle.com,
        mhocko@...e.com, rdunlap@...radead.org, hughd@...gle.com,
        richard.weiyang@...il.com, lance.yang@...ux.dev, vbabka@...e.cz,
        rppt@...nel.org, jannh@...gle.com, pfalcato@...e.de
Subject: Re: [PATCH v12 mm-new 06/15] khugepaged: introduce
 collapse_max_ptes_none helper function

On Wed, Oct 29, 2025 at 03:10:19PM -0600, Nico Pache wrote:
> On Wed, Oct 29, 2025 at 12:42 PM Lorenzo Stoakes
> <lorenzo.stoakes@...cle.com> wrote:
> >
> > On Wed, Oct 29, 2025 at 04:04:06PM +0100, David Hildenbrand wrote:
> > > > >
> > > > > No creep, because you'll always collapse.
> > > >
> > > > OK so in the 511 scenario, do we simply immediately collapse to the largest
> > > > possible _mTHP_ page size if based on adjacent none/zero page entries in the
> > > > PTE, and _never_ collapse to PMD on this basis even if we do have sufficient
> > > > none/zero PTE entries to do so?
> > >
> > > Right. And if we fail to allocate a PMD, we would collapse to smaller sizes,
> > > and later, once a PMD is possible, collapse to a PMD.
> > >
> > > But there is no creep, as we would have collapsed a PMD right from the start
> > > either way.
> >
> > Hmm, would this mean at 511 mTHP collapse _across zero entries_ would only
> > ever collapse to PMD, except in cases where, for instance, PTE entries
> > belong to distinct VMAs and so you have to collapse to mTHP as a result?
>
> There are a few failure cases, like exceeding thresholds, or
> allocations failures, but yes your assessment is correct.

Yeah of course being mm there are thorny edge cases :) we do love those...

>
> At 511, the PMD collapse will be satisfied by a single PTE. If the
> collapse fails we will try both sides of the PMD (1024kb , 1024kb).
> the one that contains the non-none PTE will collapse

Right yes.

>
> This is where the (HPAGE_PMD_ORDER - order) comes from.
> imagine the 511 case above
> 511 >> HPAGE_PMD_ORDER - 9 == 511 >> 0 = 511 max ptes none
> 511 >> PMD_ORDER - 8 (1024kb) == 511 >> 1 = 255 max_ptes_none
>
> both of these align to the orders size minus 1.

Right.

>
> >
> > Or IOW 'always collapse to the largest size you can I don't care if it
> > takes up more memory'
> >
> > And at 0, we'd never collapse anything across zero entries, and only when
> > adjacent present entries can be collapse to mTHP/PMD do we do so?
>
> Yep!
>
> max_pte_none =0 + all mTHP sizes enabled, gives you a really good
> distribution of mTHP sizes in the systems, as zero memory will be
> wasted and the most optimal size (space wise) will eb found. At least
> for the memory allocated through khugepaged. The Defer patchset I had
> on top of this series was exactly for that purpose-- Allow khugepaged
> to determine all the THP usage in the system (other than madvise), and
> allow granular control of memory waste.

Yeah, well it's a trade off really isn't it on 'eagerness' to collapse
non-present entries :)

But we'll come back to that when David has time :)

>
> >
> > >
> > > >
> > > > And only collapse to PMD size if we have sufficient adjacent PTE entries that
> > > > are populated?
> > > >
> > > > Let's really nail this down actually so we can be super clear what the issue is
> > > > here.
> > > >
> > >
> > > I hope what I wrote above made sense.
> >
> > Asking some q's still, probably more a me thing :)
> >
> > >
> > > >
> > > > >
> > > > > Creep only happens if you wouldn't collapse a PMD without prior mTHP
> > > > > collapse, but suddenly would in the same scenario simply because you had
> > > > > prior mTHP collapse.
> > > > >
> > > > > At least that's my understanding.
> > > >
> > > > OK, that makes sense, is the logic (this may be part of the bit I haven't
> > > > reviewed yet tbh) then that for khugepaged mTHP we have the system where we
> > > > always require prior mTHP collapse _first_?
> > >
> > > So I would describe creep as
> > >
> > > "we would not collapse a PMD THP because max_ptes_none is violated, but
> > > because we collapsed smaller mTHP THPs before, we essentially suddenly have
> > > more PTEs that are not none-or-zero, making us suddenly collapse a PMD THP
> > > at the same place".
> >
> > Yeah that makes sense.
> >
> > >
> > > Assume the following: max_ptes_none = 256
> > >
> > > This means we would only collapse if at most half (256/512) of the PTEs are
> > > none-or-zero.
> > >
> > > But imagine the (simplified) PTE layout with PMD = 8 entries to simplify:
> > >
> > > [ P Z P Z P Z Z Z ]
> > >
> > > 3 Present vs. 5 Zero -> do not collapse a PMD (8)
> >
> > OK I'm thinking this is more about /ratio/ than anything else.
> >
> > PMD - <=50% - ok 5/8 = 62.5% no collapse.
>
>                 < 50%*.
>
> At 50% it's 256 which is actually the worst case scenario. But I read
> further, and it seems like you grasped the issue.

Yeah this is < 50% vs. <= 50% which are fundamentally different obviously :)

>
> >
> > >
> > > But sssume we collapse smaller mTHP (2 entries) first
> > >
> > > [ P P P P P P Z Z ]
> >
> > ...512 KB mTHP (2 entries) - <= 50% means we can do...
> >
> > >
> > > We collapsed 3x "P Z" into "P P" because the ratio allowed for it.
> >
> > Yes so that's:
> >
> > [ P Z P Z P Z Z Z ]
> >
> > ->
> >
> > [ P P P P P P Z Z ]
> >
> > Right?
> >
> > >
> > > Suddenly we have
> > >
> > > 6 Present vs 2 Zero and we collapse a PMD (8)
> > >
> > > [ P P P P P P P P ]
> > >
> > > That's the "creep" problem.
> >
> > I guess we try PMD collapse first then mTHP, but the worry is another pass
> > will collapse to PMD right?
> >
> >
> > Whereas < 50% ratio means we never end up 'propagating' or 'creeping' like
> > this because each collapse never provides enough reduction in zero entries
> > to allow for higher order collapse.
> >
> > Hence the idea of capping at 255
>
> Yep! We've discussed other solutions, like tracking collapsed pages,
> or the solutions brought up by David. But this seemed like the most
> logical to me, as it keeps some of the tunability. I now understand
> the concern wasnt so much the capping, but rather the silent nature of
> it, and the uAPI expectations surrounding enforcing such a limit (for
> both past and future behavioral expectations).

Yes, that's the primary concern on my side.

>
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > max_ptes_none == 0 -> collapse mTHP only if all non-none/zero
> > > > > > >
> > > > > > > And for the intermediate values
> > > > > > >
> > > > > > > (1) pr_warn() when mTHPs are enabled, stating that mTHP collapse is not
> > > > > > > supported yet with other values
> > > > > >
> > > > > > It feels a bit much to issue a kernel warning every time somebody twiddles that
> > > > > > value, and it's kind of against user expectation a bit.
> > > > >
> > > > > pr_warn_once() is what I meant.
> > > >
> > > > Right, but even then it feels a bit extreme, warnings are pretty serious
> > > > things. Then again there's precedent for this, and it may be the least worse
> > > > solution.
> > > >
> > > > I just picture a cloud provider turning this on with mTHP then getting their
> > > > monitoring team reporting some urgent communication about warnings in dmesg :)
> > >
> > > I mean, one could make the states mutually, maybe?
> > >
> > > Disallow enabling mTHP with max_ptes_none set to unsupported values and the
> > > other way around.
> > >
> > > That would probably be cleanest, although the implementation might get a bit
> > > more involved (but it's solvable).
> > >
> > > But the concern could be that there are configs that could suddenly break:
> > > someone that set max_ptes_none and enabled mTHP.
> >
> > Yeah we could always return an error on setting to an unsupported value.
> >
> > I mean pr_warn() is nasty but maybe necessary.
> >
> > >
> > >
> > > I'll note that we could also consider only supporting "max_ptes_none = 511"
> > > (default) to start with.
> > >
> > > The nice thing about that value is that it us fully supported with the
> > > underused shrinker, because max_ptes_none=511 -> never shrink.
> >
> > It feels like = 0 would be useful though?
>
> I personally think the default of 511 is wrong and should be on the
> lower end of the scale. The exception being thp=always, where I
> believe the kernel should treat it as 511.

I think that'd be confusing to have different behaviour for thp=always, and I'd
rather we didn't do that.

But ultimately it's all moot I think as these are all uAPI things now.

It was a mistake to even export this IMO, but that can't be helped now :)

>
> But the second part of that would also violate the users max_ptes_none
> setting, so it's probably much harder in practice, and also not really
> part of this series, just my opinion.

I'm confused what you mean here?

In any case I think the 511/0 solution is the way forwards.

>
> Cheers.
> -- Nico
>
> >
> > >
> > > --
> > > Cheers
> > >
> > > David / dhildenb
> > >
> >
> > Thanks, Lorenzo
> >
>

Cheers, Lorenzo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ