linux-kernel - Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <063273aa-a852-492a-93da-ba5229f544ca@lucifer.local>
Date: Sat, 7 Jun 2025 09:18:13 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Usama Arif <usamaarif642@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, david@...hat.com,
        linux-mm@...ck.org, hannes@...xchg.org, shakeel.butt@...ux.dev,
        riel@...riel.com, ziy@...dia.com, baolin.wang@...ux.alibaba.com,
        Liam.Howlett@...cle.com, npache@...hat.com, ryan.roberts@....com,
        dev.jain@....com, hughd@...gle.com, linux-kernel@...r.kernel.org,
        linux-doc@...r.kernel.org, kernel-team@...a.com
Subject: Re: [RFC] mm: khugepaged: use largest enabled hugepage order for
 min_free_kbytes

It's important to base against mm-new for new mm stuff, PAGE_BLOCK_ORDER got
renamed to PAGE_BLOCK_MAX_ORDER in Zi's series at [0] and this doesn't compile.

Please always do a quick rebase + compile check before sending.

[0]:  https://lkml.kernel.org/r/20250604211427.1590859-1-ziy@nvidia.com

Overall this seems to me to be implemented at the wrong level of
abstraction - we implement set_recommended_min_free_kbytes() to interact
with the page block mechanism.

While the problem you describe is absolutely a problem and we need to
figure out a way to avoid reserving ridiculous amounts of memory for higher
page tables, we surely need to figure this out at a page block granularity
don't we?

On Fri, Jun 06, 2025 at 03:37:00PM +0100, Usama Arif wrote:
> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
> watermarks are evaluated to extremely high values, for e.g. a server with
> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
> of the sizes set to never, the min, low and high watermarks evaluate to
> 11.2G, 14G and 16.8G respectively.
> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
> and 1G respectively.
> This is because set_recommended_min_free_kbytes is designed for PMD
> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).

Right it is, but not this line really, the _pageblock order_ is set to be
the minimum of the huge page PMD order and PAGE_BLOCK_MAX_ORDER as it makes
sense to use page block heuristics to reduce the odds of fragmentation and
so we can grab a PMD huge page at a time.

Obviously if the user wants to set a _smaller_ page block order they can,
but if it's larger we want to heuristically avoid fragmentation of
physically contiguous huge page aligned ranges (the whole page block
mechanism).

I absolutely hate how set_recommended_min_free_kbytes() has basically
hacked in some THP considerations but otherwise invokes
calculate_min_free_kbytes()... ugh. But an existing issue.

> Such high watermark values can cause performance and latency issues in
> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
> most of them would never actually use a 512M PMD THP.

512MB, yeah crazy. We've not thought this through, and this is a very real
issue.

Again, it strikes me that we should be changing the page block order for 64
KB arm64 rather than this calculation though.

Keep in mind pageblocks are a heuristic mechanism designed to reduce
fragmentation, the decision could be made to cap how far we're willing to
go with that...

>
> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
> folio order enabled in set_recommended_min_free_kbytes.
> With this patch, when only 2M THP hugepage size is set to madvise for the
> same machine with 64K page size, with the rest of the sizes set to never,
> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
> respectively. When 512M THP hugepage size is set to madvise for the same
> machine with 64K page size, the min, low and high watermarks evaluate to
> 11.2G, 14G and 16.8G respectively, the same as without this patch.

Hmm, but what happens if a user changes this live, does this get updated?

OK I see it does via:

sysfs stuff -> enabled_store() -> start_stop_khugepaged() -> set_recommended_min_free_kbytes()

But don't we want to change this in general? Does somebody happening to
have 512MB THP at madvise or always suggest we want insane watermark
numbers?

I'm not really convinced by this 'dynamic' aspect, you're changing global
watermark numbers and reserves _massively_ based on a 'maybe' use of
something that's meant to be transparent + best-effort...

>
> An alternative solution would be to change PAGE_BLOCK_ORDER by changing
> ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this
> is not dynamic with hugepage size, will need different kernel builds for
> different hugepage sizes and most users won't know that this needs to be
> done as it can be difficult to detmermine that the performance and latency
> issues are coming from the high watermark values.

Or, we could adjust pageblock_order accordingly in this instance no?

>
> All watermark numbers are for zones of nodes that had the highest number
> of pages, i.e. the value for min size for 4K is obtained using:
> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}';
> and for 64K using:
> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}';
>
> An arbirtary min of 128 pages is used for when no hugepage sizes are set
> enabled.

I don't think it's really okay to out and out add an arbitrary value like this
without explanation. This is basis for rejection of the patch already.

That seems a little low too no?

IMPORTANT: I'd really like to see some before/after numbers for 4k, 16k,
64k with THP enabled/disabled so you can prove your patch isn't
fundamentally changing these values unexpectedly for users that aren't
using crazy page sizes.

>
> Signed-off-by: Usama Arif <usamaarif642@...il.com>
> ---
>  include/linux/huge_mm.h | 25 +++++++++++++++++++++++++
>  mm/khugepaged.c         | 32 ++++++++++++++++++++++++++++----
>  mm/shmem.c              | 29 +++++------------------------
>  3 files changed, 58 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..fb4e51ef0acb 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -170,6 +170,25 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item)
>  }
>  #endif
>
> +/*
> + * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
> + *
> + * SHMEM_HUGE_NEVER:
> + *	disables huge pages for the mount;
> + * SHMEM_HUGE_ALWAYS:
> + *	enables huge pages for the mount;
> + * SHMEM_HUGE_WITHIN_SIZE:
> + *	only allocate huge pages if the page will be fully within i_size,
> + *	also respect madvise() hints;
> + * SHMEM_HUGE_ADVISE:
> + *	only allocate huge pages if requested with madvise();
> + */
> +
> + #define SHMEM_HUGE_NEVER	0
> + #define SHMEM_HUGE_ALWAYS	1
> + #define SHMEM_HUGE_WITHIN_SIZE	2
> + #define SHMEM_HUGE_ADVISE	3
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>
>  extern unsigned long transparent_hugepage_flags;
> @@ -177,6 +196,12 @@ extern unsigned long huge_anon_orders_always;
>  extern unsigned long huge_anon_orders_madvise;
>  extern unsigned long huge_anon_orders_inherit;
>
> +extern int shmem_huge __read_mostly;
> +extern unsigned long huge_shmem_orders_always;
> +extern unsigned long huge_shmem_orders_madvise;
> +extern unsigned long huge_shmem_orders_inherit;
> +extern unsigned long huge_shmem_orders_within_size;
> +

Rather than exposing all of this shmem state as globals, can we not just have
shmem provide a function that grabs this informtaion?

>  static inline bool hugepage_global_enabled(void)
>  {
>  	return transparent_hugepage_flags &
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 15203ea7d007..e64cba74eb2a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2607,6 +2607,26 @@ static int khugepaged(void *none)
>  	return 0;
>  }
>

> +static int thp_highest_allowable_order(void)

Thisa absolutely needs a comment.

> +{
> +	unsigned long orders = READ_ONCE(huge_anon_orders_always)
> +			       | READ_ONCE(huge_anon_orders_madvise)
> +			       | READ_ONCE(huge_shmem_orders_always)
> +			       | READ_ONCE(huge_shmem_orders_madvise)
> +			       | READ_ONCE(huge_shmem_orders_within_size);

Same comment as above, have shmem export this.

> +	if (hugepage_global_enabled())
> +		orders |= READ_ONCE(huge_anon_orders_inherit);
> +	if (shmem_huge != SHMEM_HUGE_NEVER)
> +		orders |= READ_ONCE(huge_shmem_orders_inherit);
> +
> +	return orders == 0 ? 0 : fls(orders) - 1;
> +}
> +
> +static unsigned long min_thp_pageblock_nr_pages(void)

I really really hate this name. This isn't number of pageblock pages any
more this is something else? You're not changing the page block size right?

> +{
> +	return (1UL << min(thp_highest_allowable_order(), PAGE_BLOCK_ORDER));
> +}
> +
>  static void set_recommended_min_free_kbytes(void)
>  {
>  	struct zone *zone;
> @@ -2638,12 +2658,16 @@ static void set_recommended_min_free_kbytes(void)

You provide a 'patchlet' in
https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/

That also does:

        /* Ensure 2 pageblocks are free to assist fragmentation avoidance */
-       recommended_min = pageblock_nr_pages * nr_zones * 2;
+       recommended_min = min_thp_pageblock_nr_pages() * nr_zones * 2;

So comment here - this comment is now incorrect, this isn't 2 page blocks,
it's 2 of 'sub-pageblock size as if page blocks were dynamically altered by
always/madvise THP size'.

Again, this whole thing strikes me as we're doing things at the wrong level
of abstraction.

And you're definitely now not helping avoid pageblock-sized
fragmentation. You're accepting that you need less so... why not reduce
pageblock size? :)

	/*
	 * Make sure that on average at least two pageblocks are almost free
	 * of another type, one for a migratetype to fall back to and a

^ remainder of comment

>  	 * second to avoid subsequent fallbacks of other types There are 3
>  	 * MIGRATE_TYPES we care about.
>  	 */
> -	recommended_min += pageblock_nr_pages * nr_zones *
> +	recommended_min += min_thp_pageblock_nr_pages() * nr_zones *
>  			   MIGRATE_PCPTYPES * MIGRATE_PCPTYPES;

This just seems wrong now and contradicts the comment - you're setting
minimum pages based on migrate PCP types that operate at pageblock order
but without reference to the actual number of page block pages?

So the comment is just wrong now? 'make sure there are at least two
pageblocks', well this isn't what you're doing is it? So why there are we
making reference to PCP counts etc.?

This seems like we're essentially just tuning these numbers someswhat
arbitrarily to reduce them?

>
> -	/* don't ever allow to reserve more than 5% of the lowmem */
> -	recommended_min = min(recommended_min,
> -			      (unsigned long) nr_free_buffer_pages() / 20);
> +	/*
> +	 * Don't ever allow to reserve more than 5% of the lowmem.
> +	 * Use a min of 128 pages when all THP orders are set to never.

Why? Did you just choose this number out of the blue?

Previously, on x86-64 with thp -> never on everything a pageblock order-9
wouldn't this be a much higher value?

I mean just putting '128' here is not acceptable. It needs to be justified
(even if empirically with data to back it) and defined as a named thing.


> +	 */
> +	recommended_min = clamp(recommended_min, 128,
> +				(unsigned long) nr_free_buffer_pages() / 20);
> +
>  	recommended_min <<= (PAGE_SHIFT-10);
>
>  	if (recommended_min > min_free_kbytes) {
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 0c5fb4ffa03a..8e92678d1175 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -136,10 +136,10 @@ struct shmem_options {
>  };
>
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -static unsigned long huge_shmem_orders_always __read_mostly;
> -static unsigned long huge_shmem_orders_madvise __read_mostly;
> -static unsigned long huge_shmem_orders_inherit __read_mostly;
> -static unsigned long huge_shmem_orders_within_size __read_mostly;
> +unsigned long huge_shmem_orders_always __read_mostly;
> +unsigned long huge_shmem_orders_madvise __read_mostly;
> +unsigned long huge_shmem_orders_inherit __read_mostly;
> +unsigned long huge_shmem_orders_within_size __read_mostly;

Again, we really shouldn't need to do this.

>  static bool shmem_orders_configured __initdata;
>  #endif
>
> @@ -516,25 +516,6 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>  	return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
>  }
>
> -/*
> - * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
> - *
> - * SHMEM_HUGE_NEVER:
> - *	disables huge pages for the mount;
> - * SHMEM_HUGE_ALWAYS:
> - *	enables huge pages for the mount;
> - * SHMEM_HUGE_WITHIN_SIZE:
> - *	only allocate huge pages if the page will be fully within i_size,
> - *	also respect madvise() hints;
> - * SHMEM_HUGE_ADVISE:
> - *	only allocate huge pages if requested with madvise();
> - */
> -
> -#define SHMEM_HUGE_NEVER	0
> -#define SHMEM_HUGE_ALWAYS	1
> -#define SHMEM_HUGE_WITHIN_SIZE	2
> -#define SHMEM_HUGE_ADVISE	3
> -

Again we really shouldn't need to do this, just provide some function from
shmem that gives you what you need.

>  /*
>   * Special values.
>   * Only can be set via /sys/kernel/mm/transparent_hugepage/shmem_enabled:
> @@ -551,7 +532,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  /* ifdef here to avoid bloating shmem.o when not necessary */
>
> -static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
> +int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;

Same comment.

>  static int tmpfs_huge __read_mostly = SHMEM_HUGE_NEVER;
>
>  /**
> --
> 2.47.1
>