[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <71d11a43-e9ff-46e5-988d-b39905e10f61@gmail.com>
Date: Fri, 5 Sep 2025 13:31:21 +0100
From: Usama Arif <usamaarif642@...il.com>
To: David Hildenbrand <david@...hat.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Nico Pache <npache@...hat.com>
Cc: Baolin Wang <baolin.wang@...ux.alibaba.com>, Dev Jain <dev.jain@....com>,
linux-mm@...ck.org, linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-trace-kernel@...r.kernel.org, ziy@...dia.com, Liam.Howlett@...cle.com,
ryan.roberts@....com, corbet@....net, rostedt@...dmis.org,
mhiramat@...nel.org, mathieu.desnoyers@...icios.com,
akpm@...ux-foundation.org, baohua@...nel.org, willy@...radead.org,
peterx@...hat.com, wangkefeng.wang@...wei.com, sunnanyong@...wei.com,
vishal.moola@...il.com, thomas.hellstrom@...ux.intel.com,
yang@...amperecomputing.com, kirill.shutemov@...ux.intel.com,
aarcange@...hat.com, raquini@...hat.com, anshuman.khandual@....com,
catalin.marinas@....com, tiwai@...e.de, will@...nel.org,
dave.hansen@...ux.intel.com, jack@...e.cz, cl@...two.org,
jglisse@...gle.com, surenb@...gle.com, zokeefe@...gle.com,
hannes@...xchg.org, rientjes@...gle.com, mhocko@...e.com,
rdunlap@...radead.org, hughd@...gle.com
Subject: Re: [PATCH v10 00/13] khugepaged: mTHP support
On 05/09/2025 12:55, David Hildenbrand wrote:
> On 05.09.25 13:48, Lorenzo Stoakes wrote:
>> On Wed, Sep 03, 2025 at 08:54:39PM -0600, Nico Pache wrote:
>>> On Tue, Sep 2, 2025 at 2:23 PM Usama Arif <usamaarif642@...il.com> wrote:
>>>>>>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only
>>>>>>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some
>>>>>>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter
>>>>>> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and
>>>>>> will break down those hugepages and free up zero-filled memory.
>>>>>
>>>>> You are not really taming page faults, though, you are undoing what page faults might have messed up :)
>>>>>
>>>>> I have seen in our prod workloads where
>>>>>> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure,
>>>>>> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits
>>>>>> of THPs like lower TLB misses.
>>>>>
>>>>> Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split.
>>>>>
>>>>
>>>> Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of these things thrashing the effect of the other.
>>> I believe with mTHP support in khugepaged, the max_ptes_none value in
>>> the shrinker must also leverage the 'order' scaling to properly
>>> prevent thrashing.
>>
>> No please do not extend this 'scalling' stuff somewhere else, it's really horrid.
>>
>> We have to find an alternative to that, it's extremely confusing in what is
>> already extremely confusing THP code.
>>
>> As I said before, if we can't have a boolean we need another interface, which
>> makes most sense to be a ratio or in practice, a percentage sysctl.
>>
>> Speaking with David off-list, maybe the answer - if we must have this - is to
>> add a new percentage interface and have this in lock-step with the existing
>> max_ptes_none interface. One updates the other, but internally we're just using
>> the percentage value.
>
> Yes, I'll try hacking something up and sending it as an RFC.
>
>>
>>> I've been testing a patch for this that I might include in the V11.
>>>>
>>>>> There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes
>>>>> zero pages.
>>>>>
>>>>>>
>>>>>> I do agree that the value of max_ptes_none is magical and different workloads can react very differently
>>>>>> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean
>>>>>> that the memory regression of using THP=always vs THP=madvise is halved.
>>>>>
>>>>> To which value would you set it? Just 510? 0?
>
> Sorry, I missed Usama's reply. Thanks Usama!
>
>>>>>
>>>>
>>>> There are some very large workloads in the meta fleet that I experimented with and found that having
>>>> a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was found to be an optimal
>>>> comprimise in terms of application metrics improving, having an acceptable amount of memory regression and
>>>> improved system level metrics (lower TLB misses, lower page faults). I am sure there was a better value out
>>>> there for these workloads, but not possible to experiment with every value.
>>
>> (->Usama) It's a pity that such workloads exist. But then the percentage solution should work.
>
> Good. So if there is no strong case for > 255, that's already valuable for mTHP.
>
tbh the default value of 511 is horrible. I have thought about sending a patch to change it to 0 as default
in upstream for sometime, but it might mean that people who upgrade their kernel might suddenly see
their memory not getting hugified and it could be confusing for them?
Powered by blists - more mailing lists