lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAA1CXcD1YDAbYzdYfchOWbmUasa3tN55AYroOLJb2EqoQfibvw@mail.gmail.com>
Date: Tue, 28 Oct 2025 20:47:12 -0600
From: Nico Pache <npache@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: David Hildenbrand <david@...hat.com>, linux-kernel@...r.kernel.org, 
	linux-trace-kernel@...r.kernel.org, linux-mm@...ck.org, 
	linux-doc@...r.kernel.org, ziy@...dia.com, baolin.wang@...ux.alibaba.com, 
	Liam.Howlett@...cle.com, ryan.roberts@....com, dev.jain@....com, 
	corbet@....net, rostedt@...dmis.org, mhiramat@...nel.org, 
	mathieu.desnoyers@...icios.com, akpm@...ux-foundation.org, baohua@...nel.org, 
	willy@...radead.org, peterx@...hat.com, wangkefeng.wang@...wei.com, 
	usamaarif642@...il.com, sunnanyong@...wei.com, vishal.moola@...il.com, 
	thomas.hellstrom@...ux.intel.com, yang@...amperecomputing.com, kas@...nel.org, 
	aarcange@...hat.com, raquini@...hat.com, anshuman.khandual@....com, 
	catalin.marinas@....com, tiwai@...e.de, will@...nel.org, 
	dave.hansen@...ux.intel.com, jack@...e.cz, cl@...two.org, jglisse@...gle.com, 
	surenb@...gle.com, zokeefe@...gle.com, hannes@...xchg.org, 
	rientjes@...gle.com, mhocko@...e.com, rdunlap@...radead.org, hughd@...gle.com, 
	richard.weiyang@...il.com, lance.yang@...ux.dev, vbabka@...e.cz, 
	rppt@...nel.org, jannh@...gle.com, pfalcato@...e.de
Subject: Re: [PATCH v12 mm-new 06/15] khugepaged: introduce
 collapse_max_ptes_none helper function

On Tue, Oct 28, 2025 at 1:00 PM Lorenzo Stoakes
<lorenzo.stoakes@...cle.com> wrote:
>
> On Tue, Oct 28, 2025 at 07:08:38PM +0100, David Hildenbrand wrote:
> >
> > > > > Hey Lorenzo,
> > > > >
> > > > > > I mean not to beat a dead horse re: v11 commentary, but I thought we were going
> > > > > > to implement David's idea re: the new 'eagerness' tunable, and again we're now just
> > > > > > implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?
> > > > >
> > > > > I spoke to David and he said to continue forward with this series; the
> > > > > "eagerness" tunable will take some time, and may require further
> > > > > considerations/discussion.
> > > >
> > > > Right, after talking to Johannes it got clearer that what we envisioned with
> > >
> > > I'm not sure that you meant to say go ahead with the series as-is with this
> > > silent capping?
> >
> > No, "go ahead" as in "let's find some way forward that works for all and is
> > not too crazy".
>
> Right we clearly needed to discuss that further at the time but that's moot now,
> we're figuring it out now :)
>
> >
> > [...]
> >
> > > > "eagerness" would not be like swappiness, and we will really have to be
> > > > careful here. I don't know yet when I will have time to look into that.
> > >
> > > I guess I missed this part of the converastion, what do you mean?
> >
> > Johannes raised issues with that on the list and afterwards we had an
> > offline discussion about some of the details and why something unpredictable
> > is not good.
>
> Could we get these details on-list so we can discuss them? This doesn't have to
> be urgent, but I would like to have a say in this or at least be part of the
> converastion please.
>
> >
> > >
> > > The whole concept is that we have a paramaeter whose value is _abstracted_ and
> > > which we control what it means.
> > >
> > > I'm not sure exactly why that would now be problematic? The fundamental concept
> > > seems sound no? Last I remember of the conversation this was the case.
> >
> > The basic idea was to do something abstracted as swappiness. Turns out
> > "swappiness" is really something predictable, not something we can randomly
> > change how it behaves under the hood.
> >
> > So we'd have to find something similar for "eagerness", and that's where it
> > stops being easy.
>
> I think we shouldn't be too stuck on
>
> >
> > >
> > > >
> > > > If we want to avoid the implicit capping, I think there are the following
> > > > possible approaches
> > > >
> > > > (1) Tolerate creep for now, maybe warning if the user configures it.
> > >
> > > I mean this seems a viable option if there is pressure to land this series
> > > before we have a viable uAPI for configuring this.
> > >
> > > A part of me thinks we shouldn't rush series in for that reason though and
> > > should require that we have a proper control here.
> > >
> > > But I guess this approach is the least-worst as it leaves us with the most
> > > options moving forwards.
> >
> > Yes. There is also the alternative of respecting only 0 / 511 for mTHP
> > collapse for now as discussed in the other thread.
>
> Yes I guess let's carry that on over there.
>
> I mean this is why I said it's better to try to keep things in one thread :) but
> anyway, we've forked and can't be helped now.
>
> To be clear that was a criticism of - email development - not you.
>
> It's _extremely easy_ to have this happen because one thread naturally leads to
> a broader discussion of a given topic, whereas another has questions from
> somebody else about the same topic, to which people reply and then... you have a
> fork and it can't be helped.
>
> I guess I'm saying it'd be good if we could say 'ok let's move this to X'.
>
> But that's also broken in its own way, you can't stop people from replying in
> the other thread still and yeah. It's a limitation of this model :)
>
> >
> > >
> > > > (2) Avoid creep by counting zero-filled pages towards none_or_zero.
> > >
> > > Would this really make all that much difference?
> >
> > It solves the creep problem I think, but it's a bit nasty IMHO.
>
> Ah because you'd end up wtih a bunch of zeroed pages from the prior mTHP
> collapses, interesting...
>
> Scanning for that does seem a bit nasty though yes...
>
> >
> > >
> > > > (3) Have separate toggles for each THP size. Doesn't quite solve the
> > > >      problem, only shifts it.
> > >
> > > Yeah I did wonder about this as an alternative solution. But of course it then
> > > makes it vague what the parent values means in respect of the individual levels,
> > > unless we have an 'inherit' mode there too (possible).
> > >
> > > It's going to be confusing though as max_ptes_none sits at the root khugepaged/
> > > level and I don't think any other parameter from khugepaged/ is exposed at
> > > individual page size levels.
> > >
> > > And of course doing this means we
> > >
> > > >
> > > > Anything else?
> > >
> > > Err... I mean I'm not sure if you missed it but I suggested an approach in the
> > > sub-thread - exposing mthp_max_ptes_none as a _READ-ONLY_ field at:
> > >
> > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_mthp_ptes_none
> > >
> > > Then we allow the capping, but simply document that we specify what the capped
> > > value will be here for mTHP.
> >
> > I did not have time to read the details on that so far.
>
> OK. It is a bit nasty, yes. The idea is to find something that allows the
> capping to work.
>
> >
> > It would be one solution forward. I dislike it because I think the whole
> > capping is an intermediate thing that can be (and likely must be, when
> > considering mTHP underused shrinking I think) solved in the future
> > differently. That's why I would prefer adding this only if there is no
> > other, simpler, way forward.
>
> Yes I agree that if we could avoid it it'd be great.
>
> Really I proposed this solution on the basis that we were somehow ok with the
> capping.
>
> If we can avoid that'd be ideal as it reduces complexity and 'unexpected'
> behaviour.
>
> We'll clarify on the other thread, but the 511/0 was compelling to me before as
> a simplification, and if we can have a straightforward model of how mTHP
> collapse across none/zero page PTEs behaves this is ideal.
>
> The only question is w.r.t. warnings etc. but we can handle details there.
>
> >
> > >
> > > That struck me as the simplest way of getting this series landed without
> > > necessarily violating any future eagerness which:
> > >
> > > a. Must still support khugepaged/max_ptes_none - we aren't getting away from
> > >     this, it's uAPI.
> > >
> > > b. Surely must want to do different things for mTHP in eagerness, so if we're
> > >     exposing some PTE value in max_ptes_none doing so in
> > >     khugepaged/mthp_max_ptes_none wouldn't be problematic (note again - it's
> > >     readonly so unlike max_ptes_none we don't have to worry about the other
> > >     direction).
> > >
> > > HOWEVER, eagerness might want want to change this behaviour per-mTHP size, in
> > > which case perhaps mthp_max_ptes_none would be problematic in that it is some
> > > kind of average.
> > >
> > > Then again we could always revert to putting this parameter as in (3) in that
> > > case, ugly but kinda viable.
> > >
> > > >
> > > > IIUC, creep is less of a problem when we have the underused shrinker
> > > > enabled: whatever we over-allocated can (unless longterm-pinned etc) get
> > > > reclaimed again.
> > > >
> > > > So maybe having underused-shrinker support for mTHP as well would be a
> > > > solution to tackle (1) later?
> > >
> > > How viable is this in the short term?
> >
> > I once started looking into it, but it will require quite some work, because
> > the lists will essentially include each and every (m)THP in the system ...
> > so i think we will need some redesign.
>
> Ack.
>
> This aligns with non-0/511 settings being non-functional for mTHP atm anyway.
>
> >
> > >
> > > Another possible solution:
> > >
> > > If mthp_max_ptes_none is not workable, we could have a toggle at, e.g.:
> > >
> > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_cap_collapse_none
> > >
> > > As a simple boolean. If switched on then we document that it caps mTHP as
> > > per Nico's suggestion.
> > >
> > > That way we avoid the 'silent' issue I have with all this and it's an
> > > explicit setting.
> >
> > Right, but it's another toggle I wish we wouldn't need. We could of course
> > also make it some compile-time option, but not sure if that's really any
> > better.
> >
> > I'd hope we find an easy way forward that doesn't require new toggles, at
> > least for now ...
>
> Right, well I agree if we can make this 0/511 thing work, let's do that.

Ok, great, some consensus! I will go ahead with that solution.

Just to make sure we are all on the same page,

the max_ptes_none value will be treated as 0 for anything other than
PMD collapse, or in the case of 511. Or will the max_ptes_none only
work for mTHP collapse when it is 0.

static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
{
unsigned int max_ptes_none;

/* ignore max_ptes_none limits */
if (full_scan)
return HPAGE_PMD_NR - 1;

if (order == HPAGE_PMD_ORDER)
return khugepaged_max_ptes_none;

if (khugepaged_max_ptes_none != HPAGE_PMD_NR - 1)
return 0;

return max_ptes_none >> (HPAGE_PMD_ORDER - order);
}

Here's the implementation for the first approach, looks like Baolin
was able to catch up and beat me to the other solution while I was
mulling over the thread lol

Cheers,
-- Nico


>
> Toggle are just 'least worst' workarounds on assumption of the need for capping.
>
> >
> > --
> > Cheers
> >
> > David / dhildenb
> >
>
> Thanks, Lorenzo
>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ