[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAA1CXcD+E8kSA2P5JchayQb5wrwmUVPRCK+W-qiD=Xtz03ti+Q@mail.gmail.com>
Date: Tue, 9 Sep 2025 00:36:54 -0600
From: Nico Pache <npache@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: linux-mm@...ck.org, linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
david@...hat.com, ziy@...dia.com, baolin.wang@...ux.alibaba.com,
Liam.Howlett@...cle.com, ryan.roberts@....com, dev.jain@....com,
corbet@....net, rostedt@...dmis.org, mhiramat@...nel.org,
mathieu.desnoyers@...icios.com, akpm@...ux-foundation.org, baohua@...nel.org,
willy@...radead.org, peterx@...hat.com, wangkefeng.wang@...wei.com,
usamaarif642@...il.com, sunnanyong@...wei.com, vishal.moola@...il.com,
thomas.hellstrom@...ux.intel.com, yang@...amperecomputing.com,
kirill.shutemov@...ux.intel.com, aarcange@...hat.com, raquini@...hat.com,
anshuman.khandual@....com, catalin.marinas@....com, tiwai@...e.de,
will@...nel.org, dave.hansen@...ux.intel.com, jack@...e.cz, cl@...two.org,
jglisse@...gle.com, surenb@...gle.com, zokeefe@...gle.com, hannes@...xchg.org,
rientjes@...gle.com, mhocko@...e.com, rdunlap@...radead.org, hughd@...gle.com
Subject: Re: [PATCH v10 12/13] khugepaged: add per-order mTHP khugepaged stats
On Thu, Aug 21, 2025 at 8:49 AM Lorenzo Stoakes
<lorenzo.stoakes@...cle.com> wrote:
>
> On Tue, Aug 19, 2025 at 08:16:10AM -0600, Nico Pache wrote:
> > With mTHP support inplace, let add the per-order mTHP stats for
> > exceeding NONE, SWAP, and SHARED.
> >
>
> This is really not enough of a commit message. Exceeding what, where, why,
> how? What does 'exceeding' mean here, etc. etc. More words please :)
Ok I will add more in the next version
>
> > Signed-off-by: Nico Pache <npache@...hat.com>
> > ---
> > Documentation/admin-guide/mm/transhuge.rst | 17 +++++++++++++++++
> > include/linux/huge_mm.h | 3 +++
> > mm/huge_memory.c | 7 +++++++
> > mm/khugepaged.c | 16 +++++++++++++---
> > 4 files changed, 40 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> > index 7ccb93e22852..b85547ac4fe9 100644
> > --- a/Documentation/admin-guide/mm/transhuge.rst
> > +++ b/Documentation/admin-guide/mm/transhuge.rst
> > @@ -705,6 +705,23 @@ nr_anon_partially_mapped
> > an anonymous THP as "partially mapped" and count it here, even though it
> > is not actually partially mapped anymore.
> >
> > +collapse_exceed_swap_pte
> > + The number of anonymous THP which contain at least one swap PTE.
>
> The number of anonymous THP what? Pages? Let's be specific.
ack
>
> > + Currently khugepaged does not support collapsing mTHP regions that
> > + contain a swap PTE.
>
> Wait what? So we have a counter for something that's unsupported? That
> seems not so useful?
The current implementation does not support swapped out or shared
pages. However these counters allow us to monitor when a mTHP collapse
fails due to exceeding the threshold (ie 0, hitting any swapped out or
shared page)
>
> > +
> > +collapse_exceed_none_pte
> > + The number of anonymous THP which have exceeded the none PTE threshold.
>
> THP pages. What's the 'none PTE threshold'? Do you mean
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none ?
ack, I will expand these descriptions
>
> Let's spell that out please, this is far too vague.
>
> > + With mTHP collapse, a bitmap is used to gather the state of a PMD region
> > + and is then recursively checked from largest to smallest order against
> > + the scaled max_ptes_none count. This counter indicates that the next
> > + enabled order will be checked.
>
> I think you really need to expand upon this as this is confusing and vague.
>
> I also don't think saying 'recursive' here really benefits anything, Just
> saying that we try to collapse the largest mTHP size we can in each
> instance, and then give a more 'words-y' explanation as to how
> max_ptes_none is (in effect) converted to a ratio of a PMD, and then that
> ratio is applied to the mTHP sizes.
>
> You can then go on to say that this counter measures the number of
> occasions in which this occurred.
ack I will clean it up
>
> > +
> > +collapse_exceed_shared_pte
> > + The number of anonymous THP which contain at least one shared PTE.
>
> anonymous THP pages right? :)
regions?
>
> > + Currently khugepaged does not support collapsing mTHP regions that
> > + contain a shared PTE.
>
> Again I don't really understand the purpose of creating a counter for
> something we don't support.
see above
>
> Let's add it when we support it.
>
> I also in this case and the exceed swap case don't understand what you mean
> by exceed here, you need to spell this out clearly.
>
> Perhaps the context missing here is that you _also_ count THP events in
> these counters.
>
> But again, given we have THP_... counters for the stats mTHP doesn't do
> yet, I'd say adding these is pointless.
>
> > +
> > As the system ages, allocating huge pages may be expensive as the
> > system uses memory compaction to copy data around memory to free a
> > huge page for use. There are some counters in ``/proc/vmstat`` to help
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 4ada5d1f7297..6f1593d0b4b5 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -144,6 +144,9 @@ enum mthp_stat_item {
> > MTHP_STAT_SPLIT_DEFERRED,
> > MTHP_STAT_NR_ANON,
> > MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
> > + MTHP_STAT_COLLAPSE_EXCEED_SWAP,
> > + MTHP_STAT_COLLAPSE_EXCEED_NONE,
> > + MTHP_STAT_COLLAPSE_EXCEED_SHARED,
>
> Wh do we put 'collapse' here but not in the THP equivalents?
to indicate they come from the collapse functionality. I can shorten
it by removing COLLAPSE if youd like
>
> > __MTHP_STAT_COUNT
> > };
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 20d005c2c61f..9f0470c3e983 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -639,6 +639,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
> > DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
> > DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
> > DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
> > +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> > +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> > +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> > +
> >
> > static struct attribute *anon_stats_attrs[] = {
> > &anon_fault_alloc_attr.attr,
> > @@ -655,6 +659,9 @@ static struct attribute *anon_stats_attrs[] = {
> > &split_deferred_attr.attr,
> > &nr_anon_attr.attr,
> > &nr_anon_partially_mapped_attr.attr,
> > + &collapse_exceed_swap_pte_attr.attr,
> > + &collapse_exceed_none_pte_attr.attr,
> > + &collapse_exceed_shared_pte_attr.attr,
> > NULL,
> > };
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index c13bc583a368..5a3386043f39 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -594,7 +594,9 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > continue;
> > } else {
> > result = SCAN_EXCEED_NONE_PTE;
> > - count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>
> Hm so wait you were miscounting statistics in patch 10/13 when you turned
> all this one? That's not good.
>
> This should be in place _first_ before enabling the feature.
Ok I can move them around.
>
> > + if (order == HPAGE_PMD_ORDER)
> > + count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> > + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> > goto out;
> > }
> > }
> > @@ -633,10 +635,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > * shared may cause a future higher order collapse on a
> > * rescan of the same range.
> > */
> > - if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
> > - shared > khugepaged_max_ptes_shared)) {
> > + if (order != HPAGE_PMD_ORDER) {
>
> Hm wait what? I dont understand what's going on here? You're no longer
> actually doing any check except order != HPAGE_PMD_ORDER?... am I missnig
> something?
>
> Again why we are bothering to maintain a counter that doesn't mean anything
> I don't know? I may be misinterpreting somehow however.
>
> > + result = SCAN_EXCEED_SHARED_PTE;
> > + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> > + goto out;
> > + }
> > +
> > + if (cc->is_khugepaged &&
> > + shared > khugepaged_max_ptes_shared) {
> > result = SCAN_EXCEED_SHARED_PTE;
> > count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> > + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> > goto out;
> > }
> > }
> > @@ -1084,6 +1093,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> > * range.
> > */
> > if (order != HPAGE_PMD_ORDER) {
> > + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
>
> This again seems surely to not be testing for what it claims to be
> tracking? I may again be missing context here.
We are bailing out of the mTHP collapse due to it having a SWAP page.
In turn exceeding our threshold of 0.
Cheers,
-- Nico
>
> > pte_unmap(pte);
> > mmap_read_unlock(mm);
> > result = SCAN_EXCEED_SWAP_PTE;
> > --
> > 2.50.1
> >
>
Powered by blists - more mailing lists