linux-kernel - Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4adf1f8b-781d-4ab0-b82e-49795ad712cb@gmail.com>
Date: Mon, 9 Jun 2025 12:34:25 +0100
From: Usama Arif <usamaarif642@...il.com>
To: David Hildenbrand <david@...hat.com>,
 Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org
Cc: hannes@...xchg.org, shakeel.butt@...ux.dev, riel@...riel.com,
 ziy@...dia.com, baolin.wang@...ux.alibaba.com, lorenzo.stoakes@...cle.com,
 Liam.Howlett@...cle.com, npache@...hat.com, ryan.roberts@....com,
 dev.jain@....com, hughd@...gle.com, linux-kernel@...r.kernel.org,
 linux-doc@...r.kernel.org, kernel-team@...a.com,
 Matthew Wilcox <willy@...radead.org>
Subject: Re: [RFC] mm: khugepaged: use largest enabled hugepage order for
 min_free_kbytes



On 06/06/2025 18:37, David Hildenbrand wrote:
> On 06.06.25 16:37, Usama Arif wrote:
>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
>> watermarks are evaluated to extremely high values, for e.g. a server with
>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
>> of the sizes set to never, the min, low and high watermarks evaluate to
>> 11.2G, 14G and 16.8G respectively.
>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
>> and 1G respectively.
>> This is because set_recommended_min_free_kbytes is designed for PMD
>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
>> Such high watermark values can cause performance and latency issues in
>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
>> most of them would never actually use a 512M PMD THP.
>>
>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
>> folio order enabled in set_recommended_min_free_kbytes.
>> With this patch, when only 2M THP hugepage size is set to madvise for the
>> same machine with 64K page size, with the rest of the sizes set to never,
>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
>> respectively. When 512M THP hugepage size is set to madvise for the same
>> machine with 64K page size, the min, low and high watermarks evaluate to
>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
>>
>> An alternative solution would be to change PAGE_BLOCK_ORDER by changing
>> ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this
>> is not dynamic with hugepage size, will need different kernel builds for
>> different hugepage sizes and most users won't know that this needs to be
>> done as it can be difficult to detmermine that the performance and latency
>> issues are coming from the high watermark values.
>>
>> All watermark numbers are for zones of nodes that had the highest number
>> of pages, i.e. the value for min size for 4K is obtained using:
>> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}';
>> and for 64K using:
>> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}';
>>
>> An arbirtary min of 128 pages is used for when no hugepage sizes are set
>> enabled.
>>
>> Signed-off-by: Usama Arif <usamaarif642@...il.com>
>> ---
>>   include/linux/huge_mm.h | 25 +++++++++++++++++++++++++
>>   mm/khugepaged.c         | 32 ++++++++++++++++++++++++++++----
>>   mm/shmem.c              | 29 +++++------------------------
>>   3 files changed, 58 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2f190c90192d..fb4e51ef0acb 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -170,6 +170,25 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item)
>>   }
>>   #endif
>>   +/*
>> + * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
>> + *
>> + * SHMEM_HUGE_NEVER:
>> + *    disables huge pages for the mount;
>> + * SHMEM_HUGE_ALWAYS:
>> + *    enables huge pages for the mount;
>> + * SHMEM_HUGE_WITHIN_SIZE:
>> + *    only allocate huge pages if the page will be fully within i_size,
>> + *    also respect madvise() hints;
>> + * SHMEM_HUGE_ADVISE:
>> + *    only allocate huge pages if requested with madvise();
>> + */
>> +
>> + #define SHMEM_HUGE_NEVER    0
>> + #define SHMEM_HUGE_ALWAYS    1
>> + #define SHMEM_HUGE_WITHIN_SIZE    2
>> + #define SHMEM_HUGE_ADVISE    3
>> +
>>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>     extern unsigned long transparent_hugepage_flags;
>> @@ -177,6 +196,12 @@ extern unsigned long huge_anon_orders_always;
>>   extern unsigned long huge_anon_orders_madvise;
>>   extern unsigned long huge_anon_orders_inherit;
>>   +extern int shmem_huge __read_mostly;
>> +extern unsigned long huge_shmem_orders_always;
>> +extern unsigned long huge_shmem_orders_madvise;
>> +extern unsigned long huge_shmem_orders_inherit;
>> +extern unsigned long huge_shmem_orders_within_size;
> 
> Do really all of these have to be exported?
> 

Hi David,

Thanks for the review!

For the RFC, I just did it similar to the anon ones when I got the build error
trying to use these, but yeah a much better approach would be to just have a
function in shmem that would return the largest shmem thp allowable order.

>> +
>>   static inline bool hugepage_global_enabled(void)
>>   {
>>       return transparent_hugepage_flags &
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 15203ea7d007..e64cba74eb2a 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -2607,6 +2607,26 @@ static int khugepaged(void *none)
>>       return 0;
>>   }
>>   +static int thp_highest_allowable_order(void)
> 
> Did you mean "largest" ?

Yes

> 
>> +{
>> +    unsigned long orders = READ_ONCE(huge_anon_orders_always)
>> +                   | READ_ONCE(huge_anon_orders_madvise)
>> +                   | READ_ONCE(huge_shmem_orders_always)
>> +                   | READ_ONCE(huge_shmem_orders_madvise)
>> +                   | READ_ONCE(huge_shmem_orders_within_size);
>> +    if (hugepage_global_enabled())
>> +        orders |= READ_ONCE(huge_anon_orders_inherit);
>> +    if (shmem_huge != SHMEM_HUGE_NEVER)
>> +        orders |= READ_ONCE(huge_shmem_orders_inherit);
>> +
>> +    return orders == 0 ? 0 : fls(orders) - 1;
>> +}
> 
> But how does this interact with large folios / THPs in the page cache?
> 

Yes this will be a problem.

>From what I see, there doesn't seem to be a max order for pagecache, only
mapping_set_folio_min_order for the min.
Does this mean that pagecache can fault in 128M, 256M, 512M large folios?

I think this could increase the OOM rate significantly when ARM64 servers
are used with filesystems that support large folios..

Should there be an upper limit for pagecache? If so, it would either be a new
sysfs entry (which I dont like :( ) or just try and reuse the existing entries
with something like thp_highest_allowable_order?
 

>> +
>> +static unsigned long min_thp_pageblock_nr_pages(void)
> 
> Reading the function name, I have no idea what this function is supposed to do.
> 
> 
Yeah sorry about that. I knew even before sending the RFC that this was a bad name :(

I think an issue is that pageblock_nr_pages is not really 1 << PAGE_BLOCK_ORDER but is
1 << min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) when THP is enabled.

I wanted to highlight with the name that it will use the minimum of the max THP order that
is enabled and PAGE_BLOCK_ORDER when calculating the number of pages..