[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <76D057AA-58C1-46A0-B067-EB78FE5D2D37@nvidia.com>
Date: Mon, 09 Jun 2025 09:28:06 -0400
From: Zi Yan <ziy@...dia.com>
To: Usama Arif <usamaarif642@...il.com>
Cc: David Hildenbrand <david@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
hannes@...xchg.org, shakeel.butt@...ux.dev, riel@...riel.com,
baolin.wang@...ux.alibaba.com, lorenzo.stoakes@...cle.com,
Liam.Howlett@...cle.com, npache@...hat.com, ryan.roberts@....com,
dev.jain@....com, hughd@...gle.com, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org, kernel-team@...a.com,
Matthew Wilcox <willy@...radead.org>
Subject: Re: [RFC] mm: khugepaged: use largest enabled hugepage order for
min_free_kbytes
On 9 Jun 2025, at 7:34, Usama Arif wrote:
> On 06/06/2025 18:37, David Hildenbrand wrote:
>> On 06.06.25 16:37, Usama Arif wrote:
>>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
>>> watermarks are evaluated to extremely high values, for e.g. a server with
>>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
>>> of the sizes set to never, the min, low and high watermarks evaluate to
>>> 11.2G, 14G and 16.8G respectively.
>>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
>>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
>>> and 1G respectively.
>>> This is because set_recommended_min_free_kbytes is designed for PMD
>>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
>>> Such high watermark values can cause performance and latency issues in
>>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
>>> most of them would never actually use a 512M PMD THP.
>>>
>>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
>>> folio order enabled in set_recommended_min_free_kbytes.
>>> With this patch, when only 2M THP hugepage size is set to madvise for the
>>> same machine with 64K page size, with the rest of the sizes set to never,
>>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
>>> respectively. When 512M THP hugepage size is set to madvise for the same
>>> machine with 64K page size, the min, low and high watermarks evaluate to
>>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
>>>
>>> An alternative solution would be to change PAGE_BLOCK_ORDER by changing
>>> ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this
>>> is not dynamic with hugepage size, will need different kernel builds for
>>> different hugepage sizes and most users won't know that this needs to be
>>> done as it can be difficult to detmermine that the performance and latency
>>> issues are coming from the high watermark values.
>>>
>>> All watermark numbers are for zones of nodes that had the highest number
>>> of pages, i.e. the value for min size for 4K is obtained using:
>>> cat /proc/zoneinfo | grep -i min | awk '{print $2}' | sort -n | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}';
>>> and for 64K using:
>>> cat /proc/zoneinfo | grep -i min | awk '{print $2}' | sort -n | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}';
>>>
>>> An arbirtary min of 128 pages is used for when no hugepage sizes are set
>>> enabled.
>>>
>>> Signed-off-by: Usama Arif <usamaarif642@...il.com>
>>> ---
>>> include/linux/huge_mm.h | 25 +++++++++++++++++++++++++
>>> mm/khugepaged.c | 32 ++++++++++++++++++++++++++++----
>>> mm/shmem.c | 29 +++++------------------------
>>> 3 files changed, 58 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 2f190c90192d..fb4e51ef0acb 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -170,6 +170,25 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item)
>>> }
>>> #endif
>>> +/*
>>> + * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
>>> + *
>>> + * SHMEM_HUGE_NEVER:
>>> + * disables huge pages for the mount;
>>> + * SHMEM_HUGE_ALWAYS:
>>> + * enables huge pages for the mount;
>>> + * SHMEM_HUGE_WITHIN_SIZE:
>>> + * only allocate huge pages if the page will be fully within i_size,
>>> + * also respect madvise() hints;
>>> + * SHMEM_HUGE_ADVISE:
>>> + * only allocate huge pages if requested with madvise();
>>> + */
>>> +
>>> + #define SHMEM_HUGE_NEVER 0
>>> + #define SHMEM_HUGE_ALWAYS 1
>>> + #define SHMEM_HUGE_WITHIN_SIZE 2
>>> + #define SHMEM_HUGE_ADVISE 3
>>> +
>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> extern unsigned long transparent_hugepage_flags;
>>> @@ -177,6 +196,12 @@ extern unsigned long huge_anon_orders_always;
>>> extern unsigned long huge_anon_orders_madvise;
>>> extern unsigned long huge_anon_orders_inherit;
>>> +extern int shmem_huge __read_mostly;
>>> +extern unsigned long huge_shmem_orders_always;
>>> +extern unsigned long huge_shmem_orders_madvise;
>>> +extern unsigned long huge_shmem_orders_inherit;
>>> +extern unsigned long huge_shmem_orders_within_size;
>>
>> Do really all of these have to be exported?
>>
>
> Hi David,
>
> Thanks for the review!
>
> For the RFC, I just did it similar to the anon ones when I got the build error
> trying to use these, but yeah a much better approach would be to just have a
> function in shmem that would return the largest shmem thp allowable order.
>
>>> +
>>> static inline bool hugepage_global_enabled(void)
>>> {
>>> return transparent_hugepage_flags &
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 15203ea7d007..e64cba74eb2a 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -2607,6 +2607,26 @@ static int khugepaged(void *none)
>>> return 0;
>>> }
>>> +static int thp_highest_allowable_order(void)
>>
>> Did you mean "largest" ?
>
> Yes
>
>>
>>> +{
>>> + unsigned long orders = READ_ONCE(huge_anon_orders_always)
>>> + | READ_ONCE(huge_anon_orders_madvise)
>>> + | READ_ONCE(huge_shmem_orders_always)
>>> + | READ_ONCE(huge_shmem_orders_madvise)
>>> + | READ_ONCE(huge_shmem_orders_within_size);
>>> + if (hugepage_global_enabled())
>>> + orders |= READ_ONCE(huge_anon_orders_inherit);
>>> + if (shmem_huge != SHMEM_HUGE_NEVER)
>>> + orders |= READ_ONCE(huge_shmem_orders_inherit);
>>> +
>>> + return orders == 0 ? 0 : fls(orders) - 1;
>>> +}
>>
>> But how does this interact with large folios / THPs in the page cache?
>>
>
> Yes this will be a problem.
>
> From what I see, there doesn't seem to be a max order for pagecache, only
> mapping_set_folio_min_order for the min.
Actually, there is one[1]. But it is limited by xas_split_alloc() and
can be lifted once xas_split_alloc() is gone (implying READ_ONLY_THP_FOR_FS
needs to go).
[1] https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/pagemap.h#L377
> Does this mean that pagecache can fault in 128M, 256M, 512M large folios?
>
> I think this could increase the OOM rate significantly when ARM64 servers
> are used with filesystems that support large folios..
>
> Should there be an upper limit for pagecache? If so, it would either be a new
> sysfs entry (which I dont like :( ) or just try and reuse the existing entries
> with something like thp_highest_allowable_order?
MAX_PAGECACHE_ORDER limits the max folio size at the moment in theory
and the readahead code only reads PMD level folios at max IIRC.
--
Best Regards,
Yan, Zi
Powered by blists - more mailing lists