linux-kernel - Re: [PATCH v12 mm-new 09/15] khugepaged: add per-order mTHP collapse failure statistics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAA1CXcDT19rV_08pVP7CLuUZiVHW_1rSOv2oMXUHyRxh5sGCcA@mail.gmail.com>
Date: Fri, 7 Nov 2025 10:14:10 -0700
From: Nico Pache <npache@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org, 
	linux-mm@...ck.org, linux-doc@...r.kernel.org, david@...hat.com, 
	ziy@...dia.com, baolin.wang@...ux.alibaba.com, Liam.Howlett@...cle.com, 
	ryan.roberts@....com, dev.jain@....com, corbet@....net, rostedt@...dmis.org, 
	mhiramat@...nel.org, mathieu.desnoyers@...icios.com, 
	akpm@...ux-foundation.org, baohua@...nel.org, willy@...radead.org, 
	peterx@...hat.com, wangkefeng.wang@...wei.com, usamaarif642@...il.com, 
	sunnanyong@...wei.com, vishal.moola@...il.com, 
	thomas.hellstrom@...ux.intel.com, yang@...amperecomputing.com, kas@...nel.org, 
	aarcange@...hat.com, raquini@...hat.com, anshuman.khandual@....com, 
	catalin.marinas@....com, tiwai@...e.de, will@...nel.org, 
	dave.hansen@...ux.intel.com, jack@...e.cz, cl@...two.org, jglisse@...gle.com, 
	surenb@...gle.com, zokeefe@...gle.com, hannes@...xchg.org, 
	rientjes@...gle.com, mhocko@...e.com, rdunlap@...radead.org, hughd@...gle.com, 
	richard.weiyang@...il.com, lance.yang@...ux.dev, vbabka@...e.cz, 
	rppt@...nel.org, jannh@...gle.com, pfalcato@...e.de
Subject: Re: [PATCH v12 mm-new 09/15] khugepaged: add per-order mTHP collapse
 failure statistics

On Thu, Nov 6, 2025 at 11:47 AM Lorenzo Stoakes
<lorenzo.stoakes@...cle.com> wrote:
>
> On Wed, Oct 22, 2025 at 12:37:11PM -0600, Nico Pache wrote:
> > Add three new mTHP statistics to track collapse failures for different
> > orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
> >
> > - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to swap
> >       PTEs
> >
> > - collapse_exceed_none_pte: Counts when mTHP collapse fails due to
> >       exceeding the none PTE threshold for the given order
> >
> > - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to shared
> >       PTEs
> >
> > These statistics complement the existing THP_SCAN_EXCEED_* events by
> > providing per-order granularity for mTHP collapse attempts. The stats are
> > exposed via sysfs under
> > `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
> > supported hugepage size.
> >
> > As we currently dont support collapsing mTHPs that contain a swap or
> > shared entry, those statistics keep track of how often we are
> > encountering failed mTHP collapses due to these restrictions.
> >
> > Reviewed-by: Baolin Wang <baolin.wang@...ux.alibaba.com>
> > Signed-off-by: Nico Pache <npache@...hat.com>
> > ---
> >  Documentation/admin-guide/mm/transhuge.rst | 23 ++++++++++++++++++++++
> >  include/linux/huge_mm.h                    |  3 +++
> >  mm/huge_memory.c                           |  7 +++++++
> >  mm/khugepaged.c                            | 16 ++++++++++++---
> >  4 files changed, 46 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> > index 13269a0074d4..7c71cda8aea1 100644
> > --- a/Documentation/admin-guide/mm/transhuge.rst
> > +++ b/Documentation/admin-guide/mm/transhuge.rst
> > @@ -709,6 +709,29 @@ nr_anon_partially_mapped
> >         an anonymous THP as "partially mapped" and count it here, even though it
> >         is not actually partially mapped anymore.
> >
> > +collapse_exceed_none_pte
> > +       The number of anonymous mTHP pte ranges where the number of none PTEs
>
> Ranges? Is the count per-mTHP folio? Or per PTE entry? Let's clarify.

I dont know the proper terminology. But what we have here is a range
of PTEs that is being considered for mTHP folio collapse; however, it
is still not a mTHP folio which is why I hesitated to call it that.

Given this counter is per mTHP size I think the proper way to say this would be:

The number of collapse attempts that failed due to exceeding the
max_ptes_none threshold.

>
> > +       exceeded the max_ptes_none threshold. For mTHP collapse, khugepaged
> > +       checks a PMD region and tracks which PTEs are present. It then tries
> > +       to collapse to the largest enabled mTHP size. The allowed number of empty
>
> Well and then tries to collapse to the next and etc. right? So maybe worth
> mentioning?
>
> > +       PTEs is the max_ptes_none threshold scaled by the collapse order. This
>
> I think this needs clarification, scaled how? Also obviously with the proposed
> new approach we will need to correct this to reflect the 511/0 situation.
>
> > +       counter records the number of times a collapse attempt was skipped for
> > +       this reason, and khugepaged moved on to try the next available mTHP size.
>
> OK you mention the moving on here, so for each attempted mTHP size which exeeds
> max_none_pte we increment this stat correct? Probably worth clarifying that.
>
> > +
> > +collapse_exceed_swap_pte
> > +       The number of anonymous mTHP pte ranges which contain at least one swap
> > +       PTE. Currently khugepaged does not support collapsing mTHP regions
> > +       that contain a swap PTE. This counter can be used to monitor the
> > +       number of khugepaged mTHP collapses that failed due to the presence
> > +       of a swap PTE.
>
> OK so as soon as we encounter a swap PTE we abort and this counts each instance
> of that?
>
> I guess worth spelling that out? Given we don't support it, surely the opening
> description should be 'The number of anonymous mTHP PTE ranges which were unable
> to be collapsed due to containing one or more swap PTEs'.
>
> > +
> > +collapse_exceed_shared_pte
> > +       The number of anonymous mTHP pte ranges which contain at least one shared
> > +       PTE. Currently khugepaged does not support collapsing mTHP pte ranges
> > +       that contain a shared PTE. This counter can be used to monitor the
> > +       number of khugepaged mTHP collapses that failed due to the presence
> > +       of a shared PTE.
>
> Same comments as above.
>
> > +
> >  As the system ages, allocating huge pages may be expensive as the
> >  system uses memory compaction to copy data around memory to free a
> >  huge page for use. There are some counters in ``/proc/vmstat`` to help
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 3d29624c4f3f..4b2773235041 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -144,6 +144,9 @@ enum mthp_stat_item {
> >       MTHP_STAT_SPLIT_DEFERRED,
> >       MTHP_STAT_NR_ANON,
> >       MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
> > +     MTHP_STAT_COLLAPSE_EXCEED_SWAP,
> > +     MTHP_STAT_COLLAPSE_EXCEED_NONE,
> > +     MTHP_STAT_COLLAPSE_EXCEED_SHARED,
> >       __MTHP_STAT_COUNT
> >  };
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 0063d1ba926e..7335b92969d6 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -638,6 +638,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
> >  DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
> >  DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
> >  DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
> > +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> > +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> > +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> > +
> >
> >  static struct attribute *anon_stats_attrs[] = {
> >       &anon_fault_alloc_attr.attr,
> > @@ -654,6 +658,9 @@ static struct attribute *anon_stats_attrs[] = {
> >       &split_deferred_attr.attr,
> >       &nr_anon_attr.attr,
> >       &nr_anon_partially_mapped_attr.attr,
> > +     &collapse_exceed_swap_pte_attr.attr,
> > +     &collapse_exceed_none_pte_attr.attr,
> > +     &collapse_exceed_shared_pte_attr.attr,
> >       NULL,
> >  };
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index d741af15e18c..053202141ea3 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -592,7 +592,9 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >                               continue;
> >                       } else {
> >                               result = SCAN_EXCEED_NONE_PTE;
> > -                             count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> > +                             if (order == HPAGE_PMD_ORDER)
> > +                                     count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> > +                             count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> >                               goto out;
> >                       }
> >               }
> > @@ -622,10 +624,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >                        * shared may cause a future higher order collapse on a
> >                        * rescan of the same range.
> >                        */
> > -                     if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
> > -                         shared > khugepaged_max_ptes_shared)) {
> > +                     if (order != HPAGE_PMD_ORDER) {
>

Thanks for the review! I'll go clean these up for the next version

> A little nit/idea in general for series - since we do this order !=
> HPAGE_PMD_ORDER check all over, maybe have a predict function like:
>
> static bool is_mthp_order(unsigned int order)
> {
>         return order != HPAGE_PMD_ORDER;
> }

sure!

>
> > +                             result = SCAN_EXCEED_SHARED_PTE;
> > +                             count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> > +                             goto out;
> > +                     }
> > +
> > +                     if (cc->is_khugepaged &&
> > +                         shared > khugepaged_max_ptes_shared) {
> >                               result = SCAN_EXCEED_SHARED_PTE;
> >                               count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> > +                             count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
>
> OK I _think_ I mentioned this in a previous revision so forgive me for being
> repetitious but we also count PMD orders here?
>
> But in the MTHP_STAT_COLLAPSE_EXCEED_NONE and MTP_STAT_COLLAPSE_EXCEED_SWAP
> cases we don't? Why's that?

Hmm I could have sworn I fixed that... perhaps I reintroduced the
missing stat update when I had to rebase/undo the cleanup series by
Lance. I will fix this.


Cheers.
-- Nico
>
>
> >                               goto out;
> >                       }
> >               }
> > @@ -1073,6 +1082,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> >                * range.
> >                */
> >               if (order != HPAGE_PMD_ORDER) {
> > +                     count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> >                       pte_unmap(pte);
> >                       mmap_read_unlock(mm);
> >                       result = SCAN_EXCEED_SWAP_PTE;
> > --
> > 2.51.0
> >
>
> Thanks, Lorenzo
>