[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5da7d1a7-1d89-4efc-a50e-bdb66c3e3659@lucifer.local>
Date: Mon, 9 Jun 2025 15:57:59 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Usama Arif <usamaarif642@...il.com>
Cc: ziy@...dia.com, Andrew Morton <akpm@...ux-foundation.org>,
david@...hat.com, linux-mm@...ck.org, hannes@...xchg.org,
shakeel.butt@...ux.dev, riel@...riel.com,
baolin.wang@...ux.alibaba.com, Liam.Howlett@...cle.com,
npache@...hat.com, ryan.roberts@....com, dev.jain@....com,
hughd@...gle.com, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org, kernel-team@...a.com
Subject: Re: [RFC] mm: khugepaged: use largest enabled hugepage order for
min_free_kbytes
On Mon, Jun 09, 2025 at 01:07:42PM +0100, Usama Arif wrote:
>
>
> On 07/06/2025 09:18, Lorenzo Stoakes wrote:
> > It's important to base against mm-new for new mm stuff, PAGE_BLOCK_ORDER got
> > renamed to PAGE_BLOCK_MAX_ORDER in Zi's series at [0] and this doesn't compile.
> >
> > Please always do a quick rebase + compile check before sending.
> >
> > [0]: https://lkml.kernel.org/r/20250604211427.1590859-1-ziy@nvidia.com
> >
> > Overall this seems to me to be implemented at the wrong level of
> > abstraction - we implement set_recommended_min_free_kbytes() to interact
> > with the page block mechanism.
> >
> > While the problem you describe is absolutely a problem and we need to
> > figure out a way to avoid reserving ridiculous amounts of memory for higher
> > page tables, we surely need to figure this out at a page block granularity
> > don't we?
> >
>
> Yes agreed, Zi raised a good point in [1], and I think there is no reason to just
> do it to lower watermarks, it should be done at page block order so that defrag
> also happens at for 2M and not 512M with the example given in the commit message.
>
> [1] https://lore.kernel.org/all/c600a6c0-aa59-4896-9e0d-3649a32d1771@gmail.com/
>
Yeah right exactly.
These things are heuristic anyway, and so I think it's fine to add an
additional heuristic of 'well people are unlikely to use 512 MB PMD sized
huge pages in practice on 64 KB page size systems'.
And obviously you have rightly raised that - in practice - this is really
just causing an issue rather than serving the needs of users.
> > On Fri, Jun 06, 2025 at 03:37:00PM +0100, Usama Arif wrote:
> >> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
> >> watermarks are evaluated to extremely high values, for e.g. a server with
> >> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
> >> of the sizes set to never, the min, low and high watermarks evaluate to
> >> 11.2G, 14G and 16.8G respectively.
> >> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
> >> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
> >> and 1G respectively.
> >> This is because set_recommended_min_free_kbytes is designed for PMD
> >> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
> >
> > Right it is, but not this line really, the _pageblock order_ is set to be
> > the minimum of the huge page PMD order and PAGE_BLOCK_MAX_ORDER as it makes
> > sense to use page block heuristics to reduce the odds of fragmentation and
> > so we can grab a PMD huge page at a time.
> >
> > Obviously if the user wants to set a _smaller_ page block order they can,
> > but if it's larger we want to heuristically avoid fragmentation of
> > physically contiguous huge page aligned ranges (the whole page block
> > mechanism).
> >
> > I absolutely hate how set_recommended_min_free_kbytes() has basically
> > hacked in some THP considerations but otherwise invokes
> > calculate_min_free_kbytes()... ugh. But an existing issue.
> >
> >> Such high watermark values can cause performance and latency issues in
> >> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
> >> most of them would never actually use a 512M PMD THP.
> >
> > 512MB, yeah crazy. We've not thought this through, and this is a very real
> > issue.
> >
> > Again, it strikes me that we should be changing the page block order for 64
> > KB arm64 rather than this calculation though.
> >
>
> yes agreed. I think changing pageblock_order is the right approach.
Thanks, yes!
>
> > Keep in mind pageblocks are a heuristic mechanism designed to reduce
> > fragmentation, the decision could be made to cap how far we're willing to
> > go with that...
> >
> >>
> >> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
> >> folio order enabled in set_recommended_min_free_kbytes.
> >> With this patch, when only 2M THP hugepage size is set to madvise for the
> >> same machine with 64K page size, with the rest of the sizes set to never,
> >> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
> >> respectively. When 512M THP hugepage size is set to madvise for the same
> >> machine with 64K page size, the min, low and high watermarks evaluate to
> >> 11.2G, 14G and 16.8G respectively, the same as without this patch.
> >
> > Hmm, but what happens if a user changes this live, does this get updated?
> >
> > OK I see it does via:
> >
> > sysfs stuff -> enabled_store() -> start_stop_khugepaged() -> set_recommended_min_free_kbytes()
> >
> > But don't we want to change this in general? Does somebody happening to
> > have 512MB THP at madvise or always suggest we want insane watermark
> > numbers?
>
> Unfortunately the answer to this is probably a lot of servers that use 64K
> page size do. You can see in [1] that if anyone hasn't actually configured
> the hugepage sizes via kernel commandline, and if the global policy is set
> to madvise or always, then 512M is inheriting madvise/always and they would
> have a very high watermark set. I dont think this behaviour is what most
> people are expecting.
Right they will be default have this (they are silly to do so, and we are
silly to let them but we've all said I think without fail that the THP
interface is flawed so I wont' belabour this :)
But again the 'T' in THP is transparent :) it's _trying_ to provide
PMD-sized folios, if it can.
But if it can't, then it can't, and that's ok.
>
> I actually think [1] should be wrapped in ifndef CONFIG_PAGE_SIZE_64KB,
> but its always been the case that PMD is set to inherit, so probably
> shouldnt be wrapped.
>
> [1] https://elixir.bootlin.com/linux/v6.15.1/source/mm/huge_memory.c#L782
IDEALLY I'd rather not, we should be figuring out how to do this scalably
not relying on 'well if this page size or if that page size' it's kinda a
slippery slope.
Plus I think users will naturally assume the 'PMD sized' behaviour is
consistent regardless of base page size.
I mean I don't LOVE (understatement) this 'water mark calculation is
different if PMD-sized' stuff in general. That rather smacks of unscalable
hard-coding on its own.
>
> >
> > I'm not really convinced by this 'dynamic' aspect, you're changing global
> > watermark numbers and reserves _massively_ based on a 'maybe' use of
> > something that's meant to be transparent + best-effort...
> >
>
> If someone sets 512M to madvise/always, it brings back the watermarks to
> the levels without this patch.
Yeah sorry see my follow up on this, I was just mistaken, things are
already dynamic like this.
Yuck this interface.
>
> >>
> >> An alternative solution would be to change PAGE_BLOCK_ORDER by changing
> >> ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this
> >> is not dynamic with hugepage size, will need different kernel builds for
> >> different hugepage sizes and most users won't know that this needs to be
> >> done as it can be difficult to detmermine that the performance and latency
> >> issues are coming from the high watermark values.
> >
> > Or, we could adjust pageblock_order accordingly in this instance no?
> >
> >>
> >> All watermark numbers are for zones of nodes that had the highest number
> >> of pages, i.e. the value for min size for 4K is obtained using:
> >> cat /proc/zoneinfo | grep -i min | awk '{print $2}' | sort -n | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}';
> >> and for 64K using:
> >> cat /proc/zoneinfo | grep -i min | awk '{print $2}' | sort -n | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}';
> >>
> >> An arbirtary min of 128 pages is used for when no hugepage sizes are set
> >> enabled.
> >
> > I don't think it's really okay to out and out add an arbitrary value like this
> > without explanation. This is basis for rejection of the patch already.
> >
>
> I just took 128 from calculate_min_free_kbytes, although I realize now that over there
> its 128 kB, but over it will mean 128 pages = 128*64kB.
>
> I think maybe a better number is sqrt(lowmem_kbytes * 16) from calculate_min_free_kbytes.
>
> I cant see in git history how 128 and the sqrt number was chosen in calculate_min_free_kbytes.
Yeah, the key thing is to provide a justification, hard-coded numbers like
this are scary, not least because people don't know why it is but are
scared to change it :P
>
> > That seems a little low too no?
> >
> > IMPORTANT: I'd really like to see some before/after numbers for 4k, 16k,
> > 64k with THP enabled/disabled so you can prove your patch isn't
> > fundamentally changing these values unexpectedly for users that aren't
> > using crazy page sizes.
> >
> >>
> >> Signed-off-by: Usama Arif <usamaarif642@...il.com>
> >> ---
> >> include/linux/huge_mm.h | 25 +++++++++++++++++++++++++
> >> mm/khugepaged.c | 32 ++++++++++++++++++++++++++++----
> >> mm/shmem.c | 29 +++++------------------------
> >> 3 files changed, 58 insertions(+), 28 deletions(-)
> >>
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index 2f190c90192d..fb4e51ef0acb 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -170,6 +170,25 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item)
> >> }
> >> #endif
> >>
> >> +/*
> >> + * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
> >> + *
> >> + * SHMEM_HUGE_NEVER:
> >> + * disables huge pages for the mount;
> >> + * SHMEM_HUGE_ALWAYS:
> >> + * enables huge pages for the mount;
> >> + * SHMEM_HUGE_WITHIN_SIZE:
> >> + * only allocate huge pages if the page will be fully within i_size,
> >> + * also respect madvise() hints;
> >> + * SHMEM_HUGE_ADVISE:
> >> + * only allocate huge pages if requested with madvise();
> >> + */
> >> +
> >> + #define SHMEM_HUGE_NEVER 0
> >> + #define SHMEM_HUGE_ALWAYS 1
> >> + #define SHMEM_HUGE_WITHIN_SIZE 2
> >> + #define SHMEM_HUGE_ADVISE 3
> >> +
> >> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>
> >> extern unsigned long transparent_hugepage_flags;
> >> @@ -177,6 +196,12 @@ extern unsigned long huge_anon_orders_always;
> >> extern unsigned long huge_anon_orders_madvise;
> >> extern unsigned long huge_anon_orders_inherit;
> >>
> >> +extern int shmem_huge __read_mostly;
> >> +extern unsigned long huge_shmem_orders_always;
> >> +extern unsigned long huge_shmem_orders_madvise;
> >> +extern unsigned long huge_shmem_orders_inherit;
> >> +extern unsigned long huge_shmem_orders_within_size;
> >> +
> >
> > Rather than exposing all of this shmem state as globals, can we not just have
> > shmem provide a function that grabs this informtaion?
> >
> >> static inline bool hugepage_global_enabled(void)
> >> {
> >> return transparent_hugepage_flags &
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index 15203ea7d007..e64cba74eb2a 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -2607,6 +2607,26 @@ static int khugepaged(void *none)
> >> return 0;
> >> }
> >>
> >
> >> +static int thp_highest_allowable_order(void)
> >
> > Thisa absolutely needs a comment.
> >
> >> +{
> >> + unsigned long orders = READ_ONCE(huge_anon_orders_always)
> >> + | READ_ONCE(huge_anon_orders_madvise)
> >> + | READ_ONCE(huge_shmem_orders_always)
> >> + | READ_ONCE(huge_shmem_orders_madvise)
> >> + | READ_ONCE(huge_shmem_orders_within_size);
> >
> > Same comment as above, have shmem export this.
> >
> >> + if (hugepage_global_enabled())
> >> + orders |= READ_ONCE(huge_anon_orders_inherit);
> >> + if (shmem_huge != SHMEM_HUGE_NEVER)
> >> + orders |= READ_ONCE(huge_shmem_orders_inherit);
> >> +
> >> + return orders == 0 ? 0 : fls(orders) - 1;
> >> +}
> >> +
> >> +static unsigned long min_thp_pageblock_nr_pages(void)
> >
> > I really really hate this name. This isn't number of pageblock pages any
> > more this is something else? You're not changing the page block size right?
> >
>
> I dont like it either :)
>
> As I mentioned in reply to David now in [1], pageblock_nr_pages is not really
> 1 << PAGE_BLOCK_ORDER but is 1 << min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) when
> THP is enabled.
Yuuuuck.
>
> It needs a better name, but I think the right approach is just to change
> pageblock_order as recommended in [2]
>
> [1] https://lore.kernel.org/all/4adf1f8b-781d-4ab0-b82e-49795ad712cb@gmail.com/
Right yeah (I assume you typo'd 2 and you mean the 1 above? :P).
>
> >> +{
> >> + return (1UL << min(thp_highest_allowable_order(), PAGE_BLOCK_ORDER));
> >> +}
> >> +
> >> static void set_recommended_min_free_kbytes(void)
> >> {
> >> struct zone *zone;
> >> @@ -2638,12 +2658,16 @@ static void set_recommended_min_free_kbytes(void)
> >
> > You provide a 'patchlet' in
> > https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/
> >
> > That also does:
> >
> > /* Ensure 2 pageblocks are free to assist fragmentation avoidance */
> > - recommended_min = pageblock_nr_pages * nr_zones * 2;
> > + recommended_min = min_thp_pageblock_nr_pages() * nr_zones * 2;
> >
> > So comment here - this comment is now incorrect, this isn't 2 page blocks,
> > it's 2 of 'sub-pageblock size as if page blocks were dynamically altered by
> > always/madvise THP size'.
> >
> > Again, this whole thing strikes me as we're doing things at the wrong level
> > of abstraction.
> >
> > And you're definitely now not helping avoid pageblock-sized
> > fragmentation. You're accepting that you need less so... why not reduce
> > pageblock size? :)
> >
> > /*
> > * Make sure that on average at least two pageblocks are almost free
> > * of another type, one for a migratetype to fall back to and a
> >
> > ^ remainder of comment
> >
> >> * second to avoid subsequent fallbacks of other types There are 3
> >> * MIGRATE_TYPES we care about.
> >> */
> >> - recommended_min += pageblock_nr_pages * nr_zones *
> >> + recommended_min += min_thp_pageblock_nr_pages() * nr_zones *
> >> MIGRATE_PCPTYPES * MIGRATE_PCPTYPES;
> >
> > This just seems wrong now and contradicts the comment - you're setting
> > minimum pages based on migrate PCP types that operate at pageblock order
> > but without reference to the actual number of page block pages?
> >
> > So the comment is just wrong now? 'make sure there are at least two
> > pageblocks', well this isn't what you're doing is it? So why there are we
> > making reference to PCP counts etc.?
> >
> > This seems like we're essentially just tuning these numbers someswhat
> > arbitrarily to reduce them?
> >
> >>
> >> - /* don't ever allow to reserve more than 5% of the lowmem */
> >> - recommended_min = min(recommended_min,
> >> - (unsigned long) nr_free_buffer_pages() / 20);
> >> + /*
> >> + * Don't ever allow to reserve more than 5% of the lowmem.
> >> + * Use a min of 128 pages when all THP orders are set to never.
> >
> > Why? Did you just choose this number out of the blue?
> >
> > Previously, on x86-64 with thp -> never on everything a pageblock order-9
> > wouldn't this be a much higher value?
> >
> > I mean just putting '128' here is not acceptable. It needs to be justified
> > (even if empirically with data to back it) and defined as a named thing.
> >
> >
> >> + */
> >> + recommended_min = clamp(recommended_min, 128,
> >> + (unsigned long) nr_free_buffer_pages() / 20);
> >> +
> >> recommended_min <<= (PAGE_SHIFT-10);
> >>
> >> if (recommended_min > min_free_kbytes) {
> >> diff --git a/mm/shmem.c b/mm/shmem.c
> >> index 0c5fb4ffa03a..8e92678d1175 100644
> >> --- a/mm/shmem.c
> >> +++ b/mm/shmem.c
> >> @@ -136,10 +136,10 @@ struct shmem_options {
> >> };
> >>
> >> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> -static unsigned long huge_shmem_orders_always __read_mostly;
> >> -static unsigned long huge_shmem_orders_madvise __read_mostly;
> >> -static unsigned long huge_shmem_orders_inherit __read_mostly;
> >> -static unsigned long huge_shmem_orders_within_size __read_mostly;
> >> +unsigned long huge_shmem_orders_always __read_mostly;
> >> +unsigned long huge_shmem_orders_madvise __read_mostly;
> >> +unsigned long huge_shmem_orders_inherit __read_mostly;
> >> +unsigned long huge_shmem_orders_within_size __read_mostly;
> >
> > Again, we really shouldn't need to do this.
> >
> >> static bool shmem_orders_configured __initdata;
> >> #endif
> >>
> >> @@ -516,25 +516,6 @@ static bool shmem_confirm_swap(struct address_space *mapping,
> >> return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
> >> }
> >>
> >> -/*
> >> - * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
> >> - *
> >> - * SHMEM_HUGE_NEVER:
> >> - * disables huge pages for the mount;
> >> - * SHMEM_HUGE_ALWAYS:
> >> - * enables huge pages for the mount;
> >> - * SHMEM_HUGE_WITHIN_SIZE:
> >> - * only allocate huge pages if the page will be fully within i_size,
> >> - * also respect madvise() hints;
> >> - * SHMEM_HUGE_ADVISE:
> >> - * only allocate huge pages if requested with madvise();
> >> - */
> >> -
> >> -#define SHMEM_HUGE_NEVER 0
> >> -#define SHMEM_HUGE_ALWAYS 1
> >> -#define SHMEM_HUGE_WITHIN_SIZE 2
> >> -#define SHMEM_HUGE_ADVISE 3
> >> -
> >
> > Again we really shouldn't need to do this, just provide some function from
> > shmem that gives you what you need.
> >
> >> /*
> >> * Special values.
> >> * Only can be set via /sys/kernel/mm/transparent_hugepage/shmem_enabled:
> >> @@ -551,7 +532,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
> >> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> /* ifdef here to avoid bloating shmem.o when not necessary */
> >>
> >> -static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
> >> +int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
> >
> > Same comment.
> >
> >> static int tmpfs_huge __read_mostly = SHMEM_HUGE_NEVER;
> >>
> >> /**
> >> --
> >> 2.47.1
> >>
>
Powered by blists - more mailing lists