[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAA1CXcCvOM+ddk1-d=8d7aFG0ZZ+shKhNCa6ppewSxS8BHW0OA@mail.gmail.com>
Date: Tue, 28 Oct 2025 07:57:13 -0600
From: Nico Pache <npache@...hat.com>
To: Baolin Wang <baolin.wang@...ux.alibaba.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, linux-kernel@...r.kernel.org,
linux-trace-kernel@...r.kernel.org, linux-mm@...ck.org,
linux-doc@...r.kernel.org, david@...hat.com, ziy@...dia.com,
Liam.Howlett@...cle.com, ryan.roberts@....com, dev.jain@....com,
corbet@....net, rostedt@...dmis.org, mhiramat@...nel.org,
mathieu.desnoyers@...icios.com, akpm@...ux-foundation.org, baohua@...nel.org,
willy@...radead.org, peterx@...hat.com, wangkefeng.wang@...wei.com,
usamaarif642@...il.com, sunnanyong@...wei.com, vishal.moola@...il.com,
thomas.hellstrom@...ux.intel.com, yang@...amperecomputing.com, kas@...nel.org,
aarcange@...hat.com, raquini@...hat.com, anshuman.khandual@....com,
catalin.marinas@....com, tiwai@...e.de, will@...nel.org,
dave.hansen@...ux.intel.com, jack@...e.cz, cl@...two.org, jglisse@...gle.com,
surenb@...gle.com, zokeefe@...gle.com, hannes@...xchg.org,
rientjes@...gle.com, mhocko@...e.com, rdunlap@...radead.org, hughd@...gle.com,
richard.weiyang@...il.com, lance.yang@...ux.dev, vbabka@...e.cz,
rppt@...nel.org, jannh@...gle.com, pfalcato@...e.de
Subject: Re: [PATCH v12 mm-new 06/15] khugepaged: introduce
collapse_max_ptes_none helper function
On Tue, Oct 28, 2025 at 4:10 AM Baolin Wang
<baolin.wang@...ux.alibaba.com> wrote:
>
>
>
> On 2025/10/28 01:53, Lorenzo Stoakes wrote:
> > On Wed, Oct 22, 2025 at 12:37:08PM -0600, Nico Pache wrote:
> >> The current mechanism for determining mTHP collapse scales the
> >> khugepaged_max_ptes_none value based on the target order. This
> >> introduces an undesirable feedback loop, or "creep", when max_ptes_none
> >> is set to a value greater than HPAGE_PMD_NR / 2.
> >>
> >> With this configuration, a successful collapse to order N will populate
> >> enough pages to satisfy the collapse condition on order N+1 on the next
> >> scan. This leads to unnecessary work and memory churn.
> >>
> >> To fix this issue introduce a helper function that caps the max_ptes_none
> >> to HPAGE_PMD_NR / 2 - 1 (255 on 4k page size). The function also scales
> >> the max_ptes_none number by the (PMD_ORDER - target collapse order).
> >>
> >> The limits can be ignored by passing full_scan=true, this is useful for
> >> madvise_collapse (which ignores limits), or in the case of
> >> collapse_scan_pmd(), allows the full PMD to be scanned when mTHP
> >> collapse is available.
> >>
> >> Signed-off-by: Nico Pache <npache@...hat.com>
> >> ---
> >> mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
> >> 1 file changed, 34 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index 4ccebf5dda97..286c3a7afdee 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -459,6 +459,39 @@ void __khugepaged_enter(struct mm_struct *mm)
> >> wake_up_interruptible(&khugepaged_wait);
> >> }
> >>
> >> +/**
> >> + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
> >> + * @order: The folio order being collapsed to
> >> + * @full_scan: Whether this is a full scan (ignore limits)
> >> + *
> >> + * For madvise-triggered collapses (full_scan=true), all limits are bypassed
> >> + * and allow up to HPAGE_PMD_NR - 1 empty PTEs.
> >> + *
> >> + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
> >> + * khugepaged_max_ptes_none value.
> >> + *
> >> + * For mTHP collapses, scale down the max_ptes_none proportionally to the folio
> >> + * order, but caps it at HPAGE_PMD_NR/2-1 to prevent a collapse feedback loop.
> >> + *
> >> + * Return: Maximum number of empty PTEs allowed for the collapse operation
> >> + */
> >> +static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
> >> +{
> >> + unsigned int max_ptes_none;
> >> +
> >> + /* ignore max_ptes_none limits */
> >> + if (full_scan)
> >> + return HPAGE_PMD_NR - 1;
> >> +
> >> + if (order == HPAGE_PMD_ORDER)
> >> + return khugepaged_max_ptes_none;
> >> +
> >> + max_ptes_none = min(khugepaged_max_ptes_none, HPAGE_PMD_NR/2 - 1);
> >
> > I mean not to beat a dead horse re: v11 commentary, but I thought we were going
> > to implement David's idea re: the new 'eagerness' tunable, and again we're now just
> > implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?
> >
> > I'm still really quite uncomfortable with us silently capping this value.
> >
> > If we're putting forward theoretical ideas that are to be later built upon, this
> > series should be an RFC.
> >
> > But if we really intend to silently ignore user input the problem is that then
> > becomes established uAPI.
> >
> > I think it's _sensible_ to avoid this mTHP escalation problem, but the issue is
> > visibility I think.
> >
> > I think people are going to find it odd that you set it to something, but then
> > get something else.
> >
> > As an alternative we could have a new sysfs field:
> >
> > /sys/kernel/mm/transparent_hugepage/khugepaged/max_mthp_ptes_none
> >
> > That shows the cap clearly.
> >
> > In fact, it could be read-only... and just expose it to the user. That reduces
> > complexity.
> >
> > We can then bring in eagerness later and have the same situation of
> > max_ptes_none being a parameter that exists (plus this additional read-only
> > parameter).
>
Hey Baolin,
> We all know that ultimately using David's suggestion to add the
> 'eagerness' tunable parameter is the best approach, but for now, we need
> an initial version to support mTHP collapse (as we've already discussed
> extensively here:)).
>
> I don't like the idea of adding another and potentially confusing
> 'max_mthp_ptes_none' interface, which might make it more difficult to
> accommodate the 'eagerness' parameter in the future.
>
> If Nico's current proposal still doesn't satisfy everyone, I personally
> lean towards David's earlier simplified approach:
> max_ptes_none == 511 -> collapse mTHP always
> max_ptes_none != 511 -> collapse mTHP only if all PTEs are non-none/zero
>
> Let's first have an initial approach in place, which will also simplify
> the following addition of the 'eagerness' tunable parameter.
>
> Nico, Lorenzo, and David, what do you think?
I still believe capping it at PMD_NR/2 provides the right mix between
preventing the undesired behavior, and keeping some degree of
tunability, as the admin guides suggests max_ptes_none should be used.
I would be willing to compromise and take this other approach until
the "eagerness" is in place. However, I do believe David's idea for
eagerness is to also cap the max_ptes_none at PMD_NR/2 for the second
to highest eagerness level (ie, 511, 255, ...). So in practice, we
won't see any behavioral changes when that series comes around;
whereas setting max_ptes_none=0 for mTHP initially, then adding
eagerness will result in a change in behavior from the initial
implementation.
With that said, Lorenzo, David, What's the final verdict?
-- Nico
>
> Code should be:
> static unsigned int collapse_max_ptes_none(unsigned int order, bool
> full_scan)
> {
> unsigned int max_ptes_none;
>
> /* ignore max_ptes_none limits */
> if (full_scan)
> return HPAGE_PMD_NR - 1;
>
> if (order == HPAGE_PMD_ORDER)
> return khugepaged_max_ptes_none;
>
> /*
> * For mTHP collapse, we can simplify the logic:
> * max_ptes_none == 511 -> collapse mTHP always
> * max_ptes_none != 511 -> collapse mTHP only if we all PTEs
> are non-none/zero
> */
> if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
> return khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER -
> order);
>
> return 0;
> }
Side note: Thank you Baolin for your review/testing of the V12 :)
>
Powered by blists - more mailing lists