linux-kernel - Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4fd6ac70-a2a6-4876-835f-73ba4475ee17@lucifer.local>
Date: Mon, 9 Jun 2025 15:58:54 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Usama Arif <usamaarif642@...il.com>
Cc: ziy@...dia.com, Andrew Morton <akpm@...ux-foundation.org>,
        david@...hat.com, linux-mm@...ck.org, hannes@...xchg.org,
        shakeel.butt@...ux.dev, riel@...riel.com,
        baolin.wang@...ux.alibaba.com, Liam.Howlett@...cle.com,
        npache@...hat.com, ryan.roberts@....com, dev.jain@....com,
        hughd@...gle.com, linux-kernel@...r.kernel.org,
        linux-doc@...r.kernel.org, kernel-team@...a.com
Subject: Re: [RFC] mm: khugepaged: use largest enabled hugepage order for
 min_free_kbytes

On Mon, Jun 09, 2025 at 01:12:25PM +0100, Usama Arif wrote:
>
> > I dont like it either :)
> >
>
> Pressed "Ctrl+enter" instead of "enter" by mistake which sent the email prematurely :)
> Adding replies to the rest of the comments in this email.

We've all been there :)

>
> As I mentioned in reply to David now in [1], pageblock_nr_pages is not really
> 1 << PAGE_BLOCK_ORDER but is 1 << min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) when
> THP is enabled.
>
> It needs a better name, but I think the right approach is just to change
> pageblock_order as recommended in [2]
>
> [1] https://lore.kernel.org/all/4adf1f8b-781d-4ab0-b82e-49795ad712cb@gmail.com/
> [2] https://lore.kernel.org/all/c600a6c0-aa59-4896-9e0d-3649a32d1771@gmail.com/
>
Replied there.

>
> >
> >>> +{
> >>> +	return (1UL << min(thp_highest_allowable_order(), PAGE_BLOCK_ORDER));
> >>> +}
> >>> +
> >>>  static void set_recommended_min_free_kbytes(void)
> >>>  {
> >>>  	struct zone *zone;
> >>> @@ -2638,12 +2658,16 @@ static void set_recommended_min_free_kbytes(void)
> >>
> >> You provide a 'patchlet' in
> >> https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/
> >>
> >> That also does:
> >>
> >>         /* Ensure 2 pageblocks are free to assist fragmentation avoidance */
> >> -       recommended_min = pageblock_nr_pages * nr_zones * 2;
> >> +       recommended_min = min_thp_pageblock_nr_pages() * nr_zones * 2;
> >>
> >> So comment here - this comment is now incorrect, this isn't 2 page blocks,
> >> it's 2 of 'sub-pageblock size as if page blocks were dynamically altered by
> >> always/madvise THP size'.
> >>
> >> Again, this whole thing strikes me as we're doing things at the wrong level
> >> of abstraction.
> >>
> >> And you're definitely now not helping avoid pageblock-sized
> >> fragmentation. You're accepting that you need less so... why not reduce
> >> pageblock size? :)
> >>
>
> Yes agreed.

:)

>
> >> 	/*
> >> 	 * Make sure that on average at least two pageblocks are almost free
> >> 	 * of another type, one for a migratetype to fall back to and a
> >>
> >> ^ remainder of comment
> >>
> >>>  	 * second to avoid subsequent fallbacks of other types There are 3
> >>>  	 * MIGRATE_TYPES we care about.
> >>>  	 */
> >>> -	recommended_min += pageblock_nr_pages * nr_zones *
> >>> +	recommended_min += min_thp_pageblock_nr_pages() * nr_zones *
> >>>  			   MIGRATE_PCPTYPES * MIGRATE_PCPTYPES;
> >>
> >> This just seems wrong now and contradicts the comment - you're setting
> >> minimum pages based on migrate PCP types that operate at pageblock order
> >> but without reference to the actual number of page block pages?
> >>
> >> So the comment is just wrong now? 'make sure there are at least two
> >> pageblocks', well this isn't what you're doing is it? So why there are we
> >> making reference to PCP counts etc.?
> >>
> >> This seems like we're essentially just tuning these numbers someswhat
> >> arbitrarily to reduce them?
> >>
> >>>
> >>> -	/* don't ever allow to reserve more than 5% of the lowmem */
> >>> -	recommended_min = min(recommended_min,
> >>> -			      (unsigned long) nr_free_buffer_pages() / 20);
> >>> +	/*
> >>> +	 * Don't ever allow to reserve more than 5% of the lowmem.
> >>> +	 * Use a min of 128 pages when all THP orders are set to never.
> >>
> >> Why? Did you just choose this number out of the blue?
>
>
> Mentioned this in the previous comment.

Ack

> >>
> >> Previously, on x86-64 with thp -> never on everything a pageblock order-9
> >> wouldn't this be a much higher value?
> >>
> >> I mean just putting '128' here is not acceptable. It needs to be justified
> >> (even if empirically with data to back it) and defined as a named thing.
> >>
> >>
> >>> +	 */
> >>> +	recommended_min = clamp(recommended_min, 128,
> >>> +				(unsigned long) nr_free_buffer_pages() / 20);
> >>> +
> >>>  	recommended_min <<= (PAGE_SHIFT-10);
> >>>
> >>>  	if (recommended_min > min_free_kbytes) {
> >>> diff --git a/mm/shmem.c b/mm/shmem.c
> >>> index 0c5fb4ffa03a..8e92678d1175 100644
> >>> --- a/mm/shmem.c
> >>> +++ b/mm/shmem.c
> >>> @@ -136,10 +136,10 @@ struct shmem_options {
> >>>  };
> >>>
> >>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> -static unsigned long huge_shmem_orders_always __read_mostly;
> >>> -static unsigned long huge_shmem_orders_madvise __read_mostly;
> >>> -static unsigned long huge_shmem_orders_inherit __read_mostly;
> >>> -static unsigned long huge_shmem_orders_within_size __read_mostly;
> >>> +unsigned long huge_shmem_orders_always __read_mostly;
> >>> +unsigned long huge_shmem_orders_madvise __read_mostly;
> >>> +unsigned long huge_shmem_orders_inherit __read_mostly;
> >>> +unsigned long huge_shmem_orders_within_size __read_mostly;
> >>
> >> Again, we really shouldn't need to do this.
>
> Agreed, for the RFC, I just did it similar to the anon ones when I got the build error
> trying to use these, but yeah a much better approach would be to just have a
> function in shmem that would return the largest shmem thp allowable order.

Ack, yeah it's fiddly but would be better this way.

>
>
> >>
> >>>  static bool shmem_orders_configured __initdata;
> >>>  #endif
> >>>
> >>> @@ -516,25 +516,6 @@ static bool shmem_confirm_swap(struct address_space *mapping,
> >>>  	return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
> >>>  }
> >>>
> >>> -/*
> >>> - * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
> >>> - *
> >>> - * SHMEM_HUGE_NEVER:
> >>> - *	disables huge pages for the mount;
> >>> - * SHMEM_HUGE_ALWAYS:
> >>> - *	enables huge pages for the mount;
> >>> - * SHMEM_HUGE_WITHIN_SIZE:
> >>> - *	only allocate huge pages if the page will be fully within i_size,
> >>> - *	also respect madvise() hints;
> >>> - * SHMEM_HUGE_ADVISE:
> >>> - *	only allocate huge pages if requested with madvise();
> >>> - */
> >>> -
> >>> -#define SHMEM_HUGE_NEVER	0
> >>> -#define SHMEM_HUGE_ALWAYS	1
> >>> -#define SHMEM_HUGE_WITHIN_SIZE	2
> >>> -#define SHMEM_HUGE_ADVISE	3
> >>> -
> >>
> >> Again we really shouldn't need to do this, just provide some function from
> >> shmem that gives you what you need.
> >>
> >>>  /*
> >>>   * Special values.
> >>>   * Only can be set via /sys/kernel/mm/transparent_hugepage/shmem_enabled:
> >>> @@ -551,7 +532,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
> >>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>  /* ifdef here to avoid bloating shmem.o when not necessary */
> >>>
> >>> -static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
> >>> +int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
> >>
> >> Same comment.
> >>
> >>>  static int tmpfs_huge __read_mostly = SHMEM_HUGE_NEVER;
> >>>
> >>>  /**
> >>> --
> >>> 2.47.1
> >>>
> >
>