[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cip5baipge3u2tb2ysds6qeoq7qjqmtkk2x7uefamcwpgh42be@24bwdor4jskq>
Date: Fri, 12 Sep 2025 16:35:44 +0100
From: Pedro Falcato <pfalcato@...e.de>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: David Hildenbrand <david@...hat.com>,
Johannes Weiner <hannes@...xchg.org>, Kiryl Shutsemau <kas@...nel.org>, Nico Pache <npache@...hat.com>,
linux-mm@...ck.org, linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-trace-kernel@...r.kernel.org, ziy@...dia.com, baolin.wang@...ux.alibaba.com,
Liam.Howlett@...cle.com, ryan.roberts@....com, dev.jain@....com, corbet@....net,
rostedt@...dmis.org, mhiramat@...nel.org, mathieu.desnoyers@...icios.com,
akpm@...ux-foundation.org, baohua@...nel.org, willy@...radead.org, peterx@...hat.com,
wangkefeng.wang@...wei.com, usamaarif642@...il.com, sunnanyong@...wei.com,
vishal.moola@...il.com, thomas.hellstrom@...ux.intel.com, yang@...amperecomputing.com,
aarcange@...hat.com, raquini@...hat.com, anshuman.khandual@....com,
catalin.marinas@....com, tiwai@...e.de, will@...nel.org, dave.hansen@...ux.intel.com,
jack@...e.cz, cl@...two.org, jglisse@...gle.com, surenb@...gle.com,
zokeefe@...gle.com, rientjes@...gle.com, mhocko@...e.com, rdunlap@...radead.org,
hughd@...gle.com, richard.weiyang@...il.com, lance.yang@...ux.dev, vbabka@...e.cz,
rppt@...nel.org, jannh@...gle.com
Subject: Re: [PATCH v11 00/15] khugepaged: mTHP support
On Fri, Sep 12, 2025 at 03:01:02PM +0100, Lorenzo Stoakes wrote:
> On Fri, Sep 12, 2025 at 03:46:36PM +0200, David Hildenbrand wrote:
> > <snip>
> > Exactly.
> >
> > And willy suggested something like "eagerness" similar to "swapinness" that
> > gives us more flexibility when implementing it, including dynamically
> > adjusting the values in the future.
>
> I like the idea of abstracting it like this, and - in a rare case of kernel
> developer agreement (esp. around naming :) - both Matthew, David and I rather
> loved referring to this as 'eagerness' here :)
>
> The great benefit in relation to dynamic state is that we can simply treat this
> as an _abstract_ thing. I.e. 'how eager are we to establish THPs, trading off
> against memory pressure and higher order folio resource consumption'.
>
> And then we can decide how precisely that is implemented in practice - and a
> sensible approach would indeed be to differentiate between scenarios where we
> might be more willing to chomp up memory vs. those we are not.
>
> This also aligns nicely with the 'grand glorious future' we all dream off (don't
> we??) in THP where things are automated as much as possible and the _kernel
> decides_ what's best as far as is possible.
>
> As with swappiness, it is essentially a 'hint' to us in abstract terms rather
> than simply exposing an internal kernel parameter.
>
> (Credit to Matthew for making this abstraction suggestion in the THP cabal
> meeting by the way!)
>
> >
> > >
> > > An extreme example: if all your THPs have 2/512 pages populated,
> > > that's still cutting TLB pressure in half!
> >
> > IIRC, you create more pressure on the huge entries, where you might have
> > less TLB entries :) But yes, there can be cases where it is beneficial, if
> > there is absolutely no memory pressure.
> >
> > >
> > > So in the absence of memory pressure, allocating and collapsing should
> > > optimally be aggressive even on very sparse regions.
> >
> > Yes, we discussed that as well in the THP cabal.
> >
> > It's very similar to the max_ptes_swapped: that parameter should not exist.
> > If there is no memory pressure we can just swap it in. If there is memory
> > pressure we probably would not want to swap in much.
>
> Yes, but at least an eagerness parameter gets us closer to this ideal.
>
> Of course, I agree that max_ptes_none should simply never have been exposed like
> this. It is emblematic of a 'just shove a parameter into a tunable/sysfs and let
> the user decide' approach you see in the kernel sometimes.
>
> This is problmeatic as users have no earthly idea how to set the parameter (most
> likely never touch it), and only start fiddling should issues arise and it looks
> like a viable solution of some kind.
>
> The problem is users usually lack a great deal of context the kernel has, and
> may make incorrect decisions that work in one situation but not another.
Note that in this case we really don't have much for context. We can trivially do
"check what number of ptes are mapped", but not anything much fancier. You can
also attempt to look at A bits (and/or check PG_referenced or PG_active). But
currently there's really nothing setup to collect this information in a timely
basis, and for anon memory (AFAIK) you only gauge this on reclaim, _if_ you
find the page itself.
The good news is that there are 3 or 4 separate movements for getting page
"temperature" information with their own special infra and daemons, for their
own special little features.
>
> TL;DR - this kind of interface is just lazy and we have to assess these kinds of
> tunables based on the actual RoI + understanding from the user's perspective.
Fully agreed.
--
Pedro
Powered by blists - more mailing lists