lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e66b671f-c6df-48c1-8045-903631a8eb85@lucifer.local>
Date: Tue, 28 Oct 2025 17:29:59 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: Nico Pache <npache@...hat.com>, linux-kernel@...r.kernel.org,
        linux-trace-kernel@...r.kernel.org, linux-mm@...ck.org,
        linux-doc@...r.kernel.org, ziy@...dia.com,
        baolin.wang@...ux.alibaba.com, Liam.Howlett@...cle.com,
        ryan.roberts@....com, dev.jain@....com, corbet@....net,
        rostedt@...dmis.org, mhiramat@...nel.org,
        mathieu.desnoyers@...icios.com, akpm@...ux-foundation.org,
        baohua@...nel.org, willy@...radead.org, peterx@...hat.com,
        wangkefeng.wang@...wei.com, usamaarif642@...il.com,
        sunnanyong@...wei.com, vishal.moola@...il.com,
        thomas.hellstrom@...ux.intel.com, yang@...amperecomputing.com,
        kas@...nel.org, aarcange@...hat.com, raquini@...hat.com,
        anshuman.khandual@....com, catalin.marinas@....com, tiwai@...e.de,
        will@...nel.org, dave.hansen@...ux.intel.com, jack@...e.cz,
        cl@...two.org, jglisse@...gle.com, surenb@...gle.com,
        zokeefe@...gle.com, hannes@...xchg.org, rientjes@...gle.com,
        mhocko@...e.com, rdunlap@...radead.org, hughd@...gle.com,
        richard.weiyang@...il.com, lance.yang@...ux.dev, vbabka@...e.cz,
        rppt@...nel.org, jannh@...gle.com, pfalcato@...e.de
Subject: Re: [PATCH v12 mm-new 06/15] khugepaged: introduce
 collapse_max_ptes_none helper function

On Tue, Oct 28, 2025 at 03:15:26PM +0100, David Hildenbrand wrote:
> On 28.10.25 14:36, Nico Pache wrote:
> > On Mon, Oct 27, 2025 at 11:54 AM Lorenzo Stoakes
> > <lorenzo.stoakes@...cle.com> wrote:
> > >
> > > On Wed, Oct 22, 2025 at 12:37:08PM -0600, Nico Pache wrote:
> > > > The current mechanism for determining mTHP collapse scales the
> > > > khugepaged_max_ptes_none value based on the target order. This
> > > > introduces an undesirable feedback loop, or "creep", when max_ptes_none
> > > > is set to a value greater than HPAGE_PMD_NR / 2.
> > > >
> > > > With this configuration, a successful collapse to order N will populate
> > > > enough pages to satisfy the collapse condition on order N+1 on the next
> > > > scan. This leads to unnecessary work and memory churn.
> > > >
> > > > To fix this issue introduce a helper function that caps the max_ptes_none
> > > > to HPAGE_PMD_NR / 2 - 1 (255 on 4k page size). The function also scales
> > > > the max_ptes_none number by the (PMD_ORDER - target collapse order).
> > > >
> > > > The limits can be ignored by passing full_scan=true, this is useful for
> > > > madvise_collapse (which ignores limits), or in the case of
> > > > collapse_scan_pmd(), allows the full PMD to be scanned when mTHP
> > > > collapse is available.
> > > >
> > > > Signed-off-by: Nico Pache <npache@...hat.com>
> > > > ---
> > > >   mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
> > > >   1 file changed, 34 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index 4ccebf5dda97..286c3a7afdee 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -459,6 +459,39 @@ void __khugepaged_enter(struct mm_struct *mm)
> > > >                wake_up_interruptible(&khugepaged_wait);
> > > >   }
> > > >
> > > > +/**
> > > > + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
> > > > + * @order: The folio order being collapsed to
> > > > + * @full_scan: Whether this is a full scan (ignore limits)
> > > > + *
> > > > + * For madvise-triggered collapses (full_scan=true), all limits are bypassed
> > > > + * and allow up to HPAGE_PMD_NR - 1 empty PTEs.
> > > > + *
> > > > + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
> > > > + * khugepaged_max_ptes_none value.
> > > > + *
> > > > + * For mTHP collapses, scale down the max_ptes_none proportionally to the folio
> > > > + * order, but caps it at HPAGE_PMD_NR/2-1 to prevent a collapse feedback loop.
> > > > + *
> > > > + * Return: Maximum number of empty PTEs allowed for the collapse operation
> > > > + */
> > > > +static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
> > > > +{
> > > > +     unsigned int max_ptes_none;
> > > > +
> > > > +     /* ignore max_ptes_none limits */
> > > > +     if (full_scan)
> > > > +             return HPAGE_PMD_NR - 1;
> > > > +
> > > > +     if (order == HPAGE_PMD_ORDER)
> > > > +             return khugepaged_max_ptes_none;
> > > > +
> > > > +     max_ptes_none = min(khugepaged_max_ptes_none, HPAGE_PMD_NR/2 - 1);
> > >
> >
> > Hey Lorenzo,
> >
> > > I mean not to beat a dead horse re: v11 commentary, but I thought we were going
> > > to implement David's idea re: the new 'eagerness' tunable, and again we're now just
> > > implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?
> >
> > I spoke to David and he said to continue forward with this series; the
> > "eagerness" tunable will take some time, and may require further
> > considerations/discussion.
>
> Right, after talking to Johannes it got clearer that what we envisioned with

I'm not sure that you meant to say go ahead with the series as-is with this
silent capping?

Either way we need better communication of this, because I wasn't aware that was
the plan for one, and it means this patch directly ignores review from 2
versions ago, which needs to be documented _somewhere_ so people aren't confused.

And it would maybe allowed us to have this converation ahead of time rather than
now.

> "eagerness" would not be like swappiness, and we will really have to be
> careful here. I don't know yet when I will have time to look into that.

I guess I missed this part of the converastion, what do you mean?

The whole concept is that we have a paramaeter whose value is _abstracted_ and
which we control what it means.

I'm not sure exactly why that would now be problematic? The fundamental concept
seems sound no? Last I remember of the conversation this was the case.

>
> If we want to avoid the implicit capping, I think there are the following
> possible approaches
>
> (1) Tolerate creep for now, maybe warning if the user configures it.

I mean this seems a viable option if there is pressure to land this series
before we have a viable uAPI for configuring this.

A part of me thinks we shouldn't rush series in for that reason though and
should require that we have a proper control here.

But I guess this approach is the least-worst as it leaves us with the most
options moving forwards.

> (2) Avoid creep by counting zero-filled pages towards none_or_zero.

Would this really make all that much difference?

> (3) Have separate toggles for each THP size. Doesn't quite solve the
>     problem, only shifts it.

Yeah I did wonder about this as an alternative solution. But of course it then
makes it vague what the parent values means in respect of the individual levels,
unless we have an 'inherit' mode there too (possible).

It's going to be confusing though as max_ptes_none sits at the root khugepaged/
level and I don't think any other parameter from khugepaged/ is exposed at
individual page size levels.

And of course doing this means we

>
> Anything else?

Err... I mean I'm not sure if you missed it but I suggested an approach in the
sub-thread - exposing mthp_max_ptes_none as a _READ-ONLY_ field at:

/sys/kernel/mm/transparent_hugepage/khugepaged/max_mthp_ptes_none

Then we allow the capping, but simply document that we specify what the capped
value will be here for mTHP.

That struck me as the simplest way of getting this series landed without
necessarily violating any future eagerness which:

a. Must still support khugepaged/max_ptes_none - we aren't getting away from
   this, it's uAPI.

b. Surely must want to do different things for mTHP in eagerness, so if we're
   exposing some PTE value in max_ptes_none doing so in
   khugepaged/mthp_max_ptes_none wouldn't be problematic (note again - it's
   readonly so unlike max_ptes_none we don't have to worry about the other
   direction).

HOWEVER, eagerness might want want to change this behaviour per-mTHP size, in
which case perhaps mthp_max_ptes_none would be problematic in that it is some
kind of average.

Then again we could always revert to putting this parameter as in (3) in that
case, ugly but kinda viable.

>
> IIUC, creep is less of a problem when we have the underused shrinker
> enabled: whatever we over-allocated can (unless longterm-pinned etc) get
> reclaimed again.
>
> So maybe having underused-shrinker support for mTHP as well would be a
> solution to tackle (1) later?

How viable is this in the short term?

>
> --
> Cheers
>
> David / dhildenb
>

Another possible solution:

If mthp_max_ptes_none is not workable, we could have a toggle at, e.g.:

/sys/kernel/mm/transparent_hugepage/khugepaged/mthp_cap_collapse_none

As a simple boolean. If switched on then we document that it caps mTHP as
per Nico's suggestion.

That way we avoid the 'silent' issue I have with all this and it's an
explicit setting.

Cheers, Lorenzo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ