linux-kernel - Re: [PATCH v10 12/13] khugepaged: add per-order mTHP khugepaged stats

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <69e9c0e9-25bb-4ff6-8469-d9137a5e5a75@lucifer.local>
Date: Thu, 21 Aug 2025 15:47:52 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Nico Pache <npache@...hat.com>
Cc: linux-mm@...ck.org, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
        david@...hat.com, ziy@...dia.com, baolin.wang@...ux.alibaba.com,
        Liam.Howlett@...cle.com, ryan.roberts@....com, dev.jain@....com,
        corbet@....net, rostedt@...dmis.org, mhiramat@...nel.org,
        mathieu.desnoyers@...icios.com, akpm@...ux-foundation.org,
        baohua@...nel.org, willy@...radead.org, peterx@...hat.com,
        wangkefeng.wang@...wei.com, usamaarif642@...il.com,
        sunnanyong@...wei.com, vishal.moola@...il.com,
        thomas.hellstrom@...ux.intel.com, yang@...amperecomputing.com,
        kirill.shutemov@...ux.intel.com, aarcange@...hat.com,
        raquini@...hat.com, anshuman.khandual@....com, catalin.marinas@....com,
        tiwai@...e.de, will@...nel.org, dave.hansen@...ux.intel.com,
        jack@...e.cz, cl@...two.org, jglisse@...gle.com, surenb@...gle.com,
        zokeefe@...gle.com, hannes@...xchg.org, rientjes@...gle.com,
        mhocko@...e.com, rdunlap@...radead.org, hughd@...gle.com
Subject: Re: [PATCH v10 12/13] khugepaged: add per-order mTHP khugepaged stats

On Tue, Aug 19, 2025 at 08:16:10AM -0600, Nico Pache wrote:
> With mTHP support inplace, let add the per-order mTHP stats for
> exceeding NONE, SWAP, and SHARED.
>

This is really not enough of a commit message. Exceeding what, where, why,
how? What does 'exceeding' mean here, etc. etc. More words please :)

> Signed-off-by: Nico Pache <npache@...hat.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst | 17 +++++++++++++++++
>  include/linux/huge_mm.h                    |  3 +++
>  mm/huge_memory.c                           |  7 +++++++
>  mm/khugepaged.c                            | 16 +++++++++++++---
>  4 files changed, 40 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 7ccb93e22852..b85547ac4fe9 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -705,6 +705,23 @@ nr_anon_partially_mapped
>         an anonymous THP as "partially mapped" and count it here, even though it
>         is not actually partially mapped anymore.
>
> +collapse_exceed_swap_pte
> +       The number of anonymous THP which contain at least one swap PTE.

The number of anonymous THP what? Pages? Let's be specific.

> +       Currently khugepaged does not support collapsing mTHP regions that
> +       contain a swap PTE.

Wait what? So we have a counter for something that's unsupported? That
seems not so useful?

> +
> +collapse_exceed_none_pte
> +       The number of anonymous THP which have exceeded the none PTE threshold.

THP pages. What's the 'none PTE threshold'? Do you mean
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none ?

Let's spell that out please, this is far too vague.

> +       With mTHP collapse, a bitmap is used to gather the state of a PMD region
> +       and is then recursively checked from largest to smallest order against
> +       the scaled max_ptes_none count. This counter indicates that the next
> +       enabled order will be checked.

I think you really need to expand upon this as this is confusing and vague.

I also don't think saying 'recursive' here really benefits anything, Just
saying that we try to collapse the largest mTHP size we can in each
instance, and then give a more 'words-y' explanation as to how
max_ptes_none is (in effect) converted to a ratio of a PMD, and then that
ratio is applied to the mTHP sizes.

You can then go on to say that this counter measures the number of
occasions in which this occurred.

> +
> +collapse_exceed_shared_pte
> +       The number of anonymous THP which contain at least one shared PTE.

anonymous THP pages right? :)

> +       Currently khugepaged does not support collapsing mTHP regions that
> +       contain a shared PTE.

Again I don't really understand the purpose of creating a counter for
something we don't support.

Let's add it when we support it.

I also in this case and the exceed swap case don't understand what you mean
by exceed here, you need to spell this out clearly.

Perhaps the context missing here is that you _also_ count THP events in
these counters.

But again, given we have THP_... counters for the stats mTHP doesn't do
yet, I'd say adding these is pointless.

> +
>  As the system ages, allocating huge pages may be expensive as the
>  system uses memory compaction to copy data around memory to free a
>  huge page for use. There are some counters in ``/proc/vmstat`` to help
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 4ada5d1f7297..6f1593d0b4b5 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -144,6 +144,9 @@ enum mthp_stat_item {
>  	MTHP_STAT_SPLIT_DEFERRED,
>  	MTHP_STAT_NR_ANON,
>  	MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
> +	MTHP_STAT_COLLAPSE_EXCEED_SWAP,
> +	MTHP_STAT_COLLAPSE_EXCEED_NONE,
> +	MTHP_STAT_COLLAPSE_EXCEED_SHARED,

Wh do we put 'collapse' here but not in the THP equivalents?

>  	__MTHP_STAT_COUNT
>  };
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 20d005c2c61f..9f0470c3e983 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -639,6 +639,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
>  DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
>  DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
>  DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> +
>
>  static struct attribute *anon_stats_attrs[] = {
>  	&anon_fault_alloc_attr.attr,
> @@ -655,6 +659,9 @@ static struct attribute *anon_stats_attrs[] = {
>  	&split_deferred_attr.attr,
>  	&nr_anon_attr.attr,
>  	&nr_anon_partially_mapped_attr.attr,
> +	&collapse_exceed_swap_pte_attr.attr,
> +	&collapse_exceed_none_pte_attr.attr,
> +	&collapse_exceed_shared_pte_attr.attr,
>  	NULL,
>  };
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index c13bc583a368..5a3386043f39 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -594,7 +594,9 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  				continue;
>  			} else {
>  				result = SCAN_EXCEED_NONE_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);

Hm so wait you were miscounting statistics in patch 10/13 when you turned
all this one? That's not good.

This should be in place _first_ before enabling the feature.

> +				if (order == HPAGE_PMD_ORDER)
> +					count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
>  				goto out;
>  			}
>  		}
> @@ -633,10 +635,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  			 * shared may cause a future higher order collapse on a
>  			 * rescan of the same range.
>  			 */
> -			if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared)) {
> +			if (order != HPAGE_PMD_ORDER) {

Hm wait what? I dont understand what's going on here? You're no longer
actually doing any check except order != HPAGE_PMD_ORDER?... am I missnig
something?

Again why we are bothering to maintain a counter that doesn't mean anything
I don't know? I may be misinterpreting somehow however.

> +				result = SCAN_EXCEED_SHARED_PTE;
> +				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> +				goto out;
> +			}
> +
> +			if (cc->is_khugepaged &&
> +			    shared > khugepaged_max_ptes_shared) {
>  				result = SCAN_EXCEED_SHARED_PTE;
>  				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +				count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
>  				goto out;
>  			}
>  		}
> @@ -1084,6 +1093,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>  		 * range.
>  		 */
>  		if (order != HPAGE_PMD_ORDER) {
> +			count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);

This again seems surely to not be testing for what it claims to be
tracking? I may again be missing context here.

>  			pte_unmap(pte);
>  			mmap_read_unlock(mm);
>  			result = SCAN_EXCEED_SWAP_PTE;
> --
> 2.50.1
>