lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3d6c013c-5592-4bb8-b438-e29787b1ab48@lucifer.local>
Date: Wed, 29 Oct 2025 18:41:28 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: Baolin Wang <baolin.wang@...ux.alibaba.com>,
        Nico Pache <npache@...hat.com>, linux-kernel@...r.kernel.org,
        linux-trace-kernel@...r.kernel.org, linux-mm@...ck.org,
        linux-doc@...r.kernel.org, ziy@...dia.com, Liam.Howlett@...cle.com,
        ryan.roberts@....com, dev.jain@....com, corbet@....net,
        rostedt@...dmis.org, mhiramat@...nel.org,
        mathieu.desnoyers@...icios.com, akpm@...ux-foundation.org,
        baohua@...nel.org, willy@...radead.org, peterx@...hat.com,
        wangkefeng.wang@...wei.com, usamaarif642@...il.com,
        sunnanyong@...wei.com, vishal.moola@...il.com,
        thomas.hellstrom@...ux.intel.com, yang@...amperecomputing.com,
        kas@...nel.org, aarcange@...hat.com, raquini@...hat.com,
        anshuman.khandual@....com, catalin.marinas@....com, tiwai@...e.de,
        will@...nel.org, dave.hansen@...ux.intel.com, jack@...e.cz,
        cl@...two.org, jglisse@...gle.com, surenb@...gle.com,
        zokeefe@...gle.com, hannes@...xchg.org, rientjes@...gle.com,
        mhocko@...e.com, rdunlap@...radead.org, hughd@...gle.com,
        richard.weiyang@...il.com, lance.yang@...ux.dev, vbabka@...e.cz,
        rppt@...nel.org, jannh@...gle.com, pfalcato@...e.de
Subject: Re: [PATCH v12 mm-new 06/15] khugepaged: introduce
 collapse_max_ptes_none helper function

On Wed, Oct 29, 2025 at 04:04:06PM +0100, David Hildenbrand wrote:
> > >
> > > No creep, because you'll always collapse.
> >
> > OK so in the 511 scenario, do we simply immediately collapse to the largest
> > possible _mTHP_ page size if based on adjacent none/zero page entries in the
> > PTE, and _never_ collapse to PMD on this basis even if we do have sufficient
> > none/zero PTE entries to do so?
>
> Right. And if we fail to allocate a PMD, we would collapse to smaller sizes,
> and later, once a PMD is possible, collapse to a PMD.
>
> But there is no creep, as we would have collapsed a PMD right from the start
> either way.

Hmm, would this mean at 511 mTHP collapse _across zero entries_ would only
ever collapse to PMD, except in cases where, for instance, PTE entries
belong to distinct VMAs and so you have to collapse to mTHP as a result?

Or IOW 'always collapse to the largest size you can I don't care if it
takes up more memory'

And at 0, we'd never collapse anything across zero entries, and only when
adjacent present entries can be collapse to mTHP/PMD do we do so?

>
> >
> > And only collapse to PMD size if we have sufficient adjacent PTE entries that
> > are populated?
> >
> > Let's really nail this down actually so we can be super clear what the issue is
> > here.
> >
>
> I hope what I wrote above made sense.

Asking some q's still, probably more a me thing :)

>
> >
> > >
> > > Creep only happens if you wouldn't collapse a PMD without prior mTHP
> > > collapse, but suddenly would in the same scenario simply because you had
> > > prior mTHP collapse.
> > >
> > > At least that's my understanding.
> >
> > OK, that makes sense, is the logic (this may be part of the bit I haven't
> > reviewed yet tbh) then that for khugepaged mTHP we have the system where we
> > always require prior mTHP collapse _first_?
>
> So I would describe creep as
>
> "we would not collapse a PMD THP because max_ptes_none is violated, but
> because we collapsed smaller mTHP THPs before, we essentially suddenly have
> more PTEs that are not none-or-zero, making us suddenly collapse a PMD THP
> at the same place".

Yeah that makes sense.

>
> Assume the following: max_ptes_none = 256
>
> This means we would only collapse if at most half (256/512) of the PTEs are
> none-or-zero.
>
> But imagine the (simplified) PTE layout with PMD = 8 entries to simplify:
>
> [ P Z P Z P Z Z Z ]
>
> 3 Present vs. 5 Zero -> do not collapse a PMD (8)

OK I'm thinking this is more about /ratio/ than anything else.

PMD - <=50% - ok 5/8 = 62.5% no collapse.

>
> But sssume we collapse smaller mTHP (2 entries) first
>
> [ P P P P P P Z Z ]

...512 KB mTHP (2 entries) - <= 50% means we can do...

>
> We collapsed 3x "P Z" into "P P" because the ratio allowed for it.

Yes so that's:

[ P Z P Z P Z Z Z ]

->

[ P P P P P P Z Z ]

Right?

>
> Suddenly we have
>
> 6 Present vs 2 Zero and we collapse a PMD (8)
>
> [ P P P P P P P P ]
>
> That's the "creep" problem.

I guess we try PMD collapse first then mTHP, but the worry is another pass
will collapse to PMD right?


Whereas < 50% ratio means we never end up 'propagating' or 'creeping' like
this because each collapse never provides enough reduction in zero entries
to allow for higher order collapse.

Hence the idea of capping at 255

>
> >
> > >
> > > >
> > > > > max_ptes_none == 0 -> collapse mTHP only if all non-none/zero
> > > > >
> > > > > And for the intermediate values
> > > > >
> > > > > (1) pr_warn() when mTHPs are enabled, stating that mTHP collapse is not
> > > > > supported yet with other values
> > > >
> > > > It feels a bit much to issue a kernel warning every time somebody twiddles that
> > > > value, and it's kind of against user expectation a bit.
> > >
> > > pr_warn_once() is what I meant.
> >
> > Right, but even then it feels a bit extreme, warnings are pretty serious
> > things. Then again there's precedent for this, and it may be the least worse
> > solution.
> >
> > I just picture a cloud provider turning this on with mTHP then getting their
> > monitoring team reporting some urgent communication about warnings in dmesg :)
>
> I mean, one could make the states mutually, maybe?
>
> Disallow enabling mTHP with max_ptes_none set to unsupported values and the
> other way around.
>
> That would probably be cleanest, although the implementation might get a bit
> more involved (but it's solvable).
>
> But the concern could be that there are configs that could suddenly break:
> someone that set max_ptes_none and enabled mTHP.

Yeah we could always return an error on setting to an unsupported value.

I mean pr_warn() is nasty but maybe necessary.

>
>
> I'll note that we could also consider only supporting "max_ptes_none = 511"
> (default) to start with.
>
> The nice thing about that value is that it us fully supported with the
> underused shrinker, because max_ptes_none=511 -> never shrink.

It feels like = 0 would be useful though?

>
> --
> Cheers
>
> David / dhildenb
>

Thanks, Lorenzo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ