[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <bf03c2e2-66fc-4745-952a-de3fbf65c4ab@redhat.com>
Date: Mon, 1 Sep 2025 19:06:21 +0200
From: David Hildenbrand <david@...hat.com>
To: Nico Pache <npache@...hat.com>, linux-mm@...ck.org,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-trace-kernel@...r.kernel.org
Cc: ziy@...dia.com, baolin.wang@...ux.alibaba.com,
lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com, ryan.roberts@....com,
dev.jain@....com, corbet@....net, rostedt@...dmis.org, mhiramat@...nel.org,
mathieu.desnoyers@...icios.com, akpm@...ux-foundation.org,
baohua@...nel.org, willy@...radead.org, peterx@...hat.com,
wangkefeng.wang@...wei.com, usamaarif642@...il.com, sunnanyong@...wei.com,
vishal.moola@...il.com, thomas.hellstrom@...ux.intel.com,
yang@...amperecomputing.com, kirill.shutemov@...ux.intel.com,
aarcange@...hat.com, raquini@...hat.com, anshuman.khandual@....com,
catalin.marinas@....com, tiwai@...e.de, will@...nel.org,
dave.hansen@...ux.intel.com, jack@...e.cz, cl@...two.org,
jglisse@...gle.com, surenb@...gle.com, zokeefe@...gle.com,
hannes@...xchg.org, rientjes@...gle.com, mhocko@...e.com,
rdunlap@...radead.org, hughd@...gle.com
Subject: Re: [PATCH v10 00/13] khugepaged: mTHP support
On 01.09.25 18:21, David Hildenbrand wrote:
> On 19.08.25 15:41, Nico Pache wrote:
>> The following series provides khugepaged with the capability to collapse
>> anonymous memory regions to mTHPs.
>>
>> To achieve this we generalize the khugepaged functions to no longer depend
>> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track chunks of
>> pages (defined by KHUGEPAGED_MTHP_MIN_ORDER) that are utilized. After the
>> PMD scan is done, we do binary recursion on the bitmap to find the optimal
>> mTHP sizes for the PMD range. The restriction on max_ptes_none is removed
>> during the scan, to make sure we account for the whole PMD range. When no
>> mTHP size is enabled, the legacy behavior of khugepaged is maintained.
>> max_ptes_none will be scaled by the attempted collapse order to determine
>> how full a mTHP must be to be eligible for the collapse to occur. If a
>> mTHP collapse is attempted, but contains swapped out, or shared pages, we
>> don't perform the collapse. It is now also possible to collapse to mTHPs
>> without requiring the PMD THP size to be enabled.
>>
>> With the default max_ptes_none=511, the code should keep its most of its
>> original behavior. When enabling multiple adjacent (m)THP sizes we need to
>> set max_ptes_none<=255. With max_ptes_none > HPAGE_PMD_NR/2 you will
>> experience collapse "creep" and constantly promote mTHPs to the next
>> available size. This is due the fact that a collapse will introduce at
>> least 2x the number of pages, and on a future scan will satisfy the
>> promotion condition once again.
>>
>> Patch 1: Refactor/rename hpage_collapse
>> Patch 2: Some refactoring to combine madvise_collapse and khugepaged
>> Patch 3-5: Generalize khugepaged functions for arbitrary orders
>> Patch 6-8: The mTHP patches
>> Patch 9-10: Allow khugepaged to operate without PMD enabled
>> Patch 11-12: Tracing/stats
>> Patch 13: Documentation
>
> Would it be feasible to start with simply not supporting the
> max_pte_none parameter in the first version, just like we won't support
> max_pte_swapped/max_pte_shared in the first version?
>
> That gives us more time to think about how to use/modify the old interface.
>
> For example, I could envision a ratio-based interface, or as discussed
> with Lorenzo a simple boolean. We could make the existing max_ptes*
> interface backwards compatible then.
>
> That also gives us the opportunity to think about the creep problem
> separately.
>
> I'm sure initial mTHP collapse will be valuable even without support for
> that weird set of parameters.
>
> Would there be implementation-wise a problem?
>
> But let me think further about the creep problem ... :/
FWIW, I just looked around and there is documented usage of setting
max_ptes_none to 0 [1, 2, 3].
In essence, I think it can make sense to set it to 0 when an application
wants to manage THP on its own (MADV_COLLAPSE), and avoid khugepaged
interfering. Now, using a system-wide toggle for such a use case is
rather questionable, but it's all we have.
I did not find anything only recommending to set values different to 0
or 511 -- so far.
So *likely* focusing on 0 vs. 511 initially would cover most use cases
out there. Ignoring the parameter initially (require all to be !none)
could of course also work.
[1] https://www.mongodb.com/docs/manual/administration/tcmalloc-performance/
[2] https://google.github.io/tcmalloc/tuning.html
[3]
https://support.yugabyte.com/hc/en-us/articles/36558155921165-Mitigating-Excessive-RSS-Memory-Usage-Due-to-THP-Transparent-Huge-Pages
--
Cheers
David / dhildenb
Powered by blists - more mailing lists