[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2d8ed924-6d06-42e4-a876-381fb331f926@redhat.com>
Date: Wed, 29 Oct 2025 16:04:06 +0100
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Baolin Wang <baolin.wang@...ux.alibaba.com>,
 Nico Pache <npache@...hat.com>, linux-kernel@...r.kernel.org,
 linux-trace-kernel@...r.kernel.org, linux-mm@...ck.org,
 linux-doc@...r.kernel.org, ziy@...dia.com, Liam.Howlett@...cle.com,
 ryan.roberts@....com, dev.jain@....com, corbet@....net, rostedt@...dmis.org,
 mhiramat@...nel.org, mathieu.desnoyers@...icios.com,
 akpm@...ux-foundation.org, baohua@...nel.org, willy@...radead.org,
 peterx@...hat.com, wangkefeng.wang@...wei.com, usamaarif642@...il.com,
 sunnanyong@...wei.com, vishal.moola@...il.com,
 thomas.hellstrom@...ux.intel.com, yang@...amperecomputing.com,
 kas@...nel.org, aarcange@...hat.com, raquini@...hat.com,
 anshuman.khandual@....com, catalin.marinas@....com, tiwai@...e.de,
 will@...nel.org, dave.hansen@...ux.intel.com, jack@...e.cz, cl@...two.org,
 jglisse@...gle.com, surenb@...gle.com, zokeefe@...gle.com,
 hannes@...xchg.org, rientjes@...gle.com, mhocko@...e.com,
 rdunlap@...radead.org, hughd@...gle.com, richard.weiyang@...il.com,
 lance.yang@...ux.dev, vbabka@...e.cz, rppt@...nel.org, jannh@...gle.com,
 pfalcato@...e.de
Subject: Re: [PATCH v12 mm-new 06/15] khugepaged: introduce
 collapse_max_ptes_none helper function
>>
>> No creep, because you'll always collapse.
> 
> OK so in the 511 scenario, do we simply immediately collapse to the largest
> possible _mTHP_ page size if based on adjacent none/zero page entries in the
> PTE, and _never_ collapse to PMD on this basis even if we do have sufficient
> none/zero PTE entries to do so?
Right. And if we fail to allocate a PMD, we would collapse to smaller 
sizes, and later, once a PMD is possible, collapse to a PMD.
But there is no creep, as we would have collapsed a PMD right from the 
start either way.
> 
> And only collapse to PMD size if we have sufficient adjacent PTE entries that
> are populated?
> 
> Let's really nail this down actually so we can be super clear what the issue is
> here.
> 
I hope what I wrote above made sense.
> 
>>
>> Creep only happens if you wouldn't collapse a PMD without prior mTHP
>> collapse, but suddenly would in the same scenario simply because you had
>> prior mTHP collapse.
>>
>> At least that's my understanding.
> 
> OK, that makes sense, is the logic (this may be part of the bit I haven't
> reviewed yet tbh) then that for khugepaged mTHP we have the system where we
> always require prior mTHP collapse _first_?
So I would describe creep as
"we would not collapse a PMD THP because max_ptes_none is violated, but 
because we collapsed smaller mTHP THPs before, we essentially suddenly 
have more PTEs that are not none-or-zero, making us suddenly collapse a 
PMD THP at the same place".
Assume the following: max_ptes_none = 256
This means we would only collapse if at most half (256/512) of the PTEs 
are none-or-zero.
But imagine the (simplified) PTE layout with PMD = 8 entries to simplify:
[ P Z P Z P Z Z Z ]
3 Present vs. 5 Zero -> do not collapse a PMD (8)
But sssume we collapse smaller mTHP (2 entries) first
[ P P P P P P Z Z ]
We collapsed 3x "P Z" into "P P" because the ratio allowed for it.
Suddenly we have
6 Present vs 2 Zero and we collapse a PMD (8)
[ P P P P P P P P ]
That's the "creep" problem.
> 
>>
>>>
>>>> max_ptes_none == 0 -> collapse mTHP only if all non-none/zero
>>>>
>>>> And for the intermediate values
>>>>
>>>> (1) pr_warn() when mTHPs are enabled, stating that mTHP collapse is not
>>>> supported yet with other values
>>>
>>> It feels a bit much to issue a kernel warning every time somebody twiddles that
>>> value, and it's kind of against user expectation a bit.
>>
>> pr_warn_once() is what I meant.
> 
> Right, but even then it feels a bit extreme, warnings are pretty serious
> things. Then again there's precedent for this, and it may be the least worse
> solution.
> 
> I just picture a cloud provider turning this on with mTHP then getting their
> monitoring team reporting some urgent communication about warnings in dmesg :)
I mean, one could make the states mutually, maybe?
Disallow enabling mTHP with max_ptes_none set to unsupported values and 
the other way around.
That would probably be cleanest, although the implementation might get a 
bit more involved (but it's solvable).
But the concern could be that there are configs that could suddenly 
break: someone that set max_ptes_none and enabled mTHP.
I'll note that we could also consider only supporting "max_ptes_none = 
511" (default) to start with.
The nice thing about that value is that it us fully supported with the 
underused shrinker, because max_ptes_none=511 -> never shrink.
-- 
Cheers
David / dhildenb
Powered by blists - more mailing lists
 
