linux-kernel - Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CEC45D33-53C3-4D74-A70C-2FCE8A3911D5@nvidia.com>
Date: Sat, 07 Jun 2025 20:04:32 -0400
From: Zi Yan <ziy@...dia.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Usama Arif <usamaarif642@...il.com>,
 Andrew Morton <akpm@...ux-foundation.org>, david@...hat.com,
 linux-mm@...ck.org, hannes@...xchg.org, shakeel.butt@...ux.dev,
 riel@...riel.com, baolin.wang@...ux.alibaba.com, Liam.Howlett@...cle.com,
 npache@...hat.com, ryan.roberts@....com, dev.jain@....com, hughd@...gle.com,
 linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
 kernel-team@...a.com, Juan Yescas <jyescas@...gle.com>,
 Breno Leitao <leitao@...ian.org>
Subject: Re: [RFC] mm: khugepaged: use largest enabled hugepage order for
 min_free_kbytes

On 7 Jun 2025, at 4:35, Lorenzo Stoakes wrote:

> On Fri, Jun 06, 2025 at 12:10:43PM -0400, Zi Yan wrote:
>> On 6 Jun 2025, at 11:38, Usama Arif wrote:
>>
>>> On 06/06/2025 16:18, Zi Yan wrote:
>>>> On 6 Jun 2025, at 10:37, Usama Arif wrote:
>>>>
>>>>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
>>>>> watermarks are evaluated to extremely high values, for e.g. a server with
>>>>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
>>>>> of the sizes set to never, the min, low and high watermarks evaluate to
>>>>> 11.2G, 14G and 16.8G respectively.
>>>>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
>>>>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
>>>>> and 1G respectively.
>>>>> This is because set_recommended_min_free_kbytes is designed for PMD
>>>>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
>>>>> Such high watermark values can cause performance and latency issues in
>>>>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
>>>>> most of them would never actually use a 512M PMD THP.
>>>>>
>>>>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
>>>>> folio order enabled in set_recommended_min_free_kbytes.
>>>>> With this patch, when only 2M THP hugepage size is set to madvise for the
>>>>> same machine with 64K page size, with the rest of the sizes set to never,
>>>>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
>>>>> respectively. When 512M THP hugepage size is set to madvise for the same
>>>>> machine with 64K page size, the min, low and high watermarks evaluate to
>>>>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
>>>>
>>>> Getting pageblock_order involved here might be confusing. I think you just
>>>> want to adjust min, low and high watermarks to reasonable values.
>>>> Is it OK to rename min_thp_pageblock_nr_pages to min_nr_free_pages_per_zone
>>>> and move MIGRATE_PCPTYPES * MIGRATE_PCPTYPES inside? Otherwise, the changes
>>>> look reasonable to me.
>>>
>>> Hi Zi,
>>>
>>> Thanks for the review!
>>>
>>> I forgot to change it in another place, sorry about that! So can't move
>>> MIGRATE_PCPTYPES * MIGRATE_PCPTYPES into the combined function.
>>> Have added the additional place where min_thp_pageblock_nr_pages() is called
>>> as a fixlet here:
>>> https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/
>>>
>>> I think atleast in this context the orginal name pageblock_nr_pages isn't
>>> correct as its min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER).
>>> The new name min_thp_pageblock_nr_pages is also not really good, so happy
>>> to change it to something appropriate.
>>
>> Got it. pageblock is the defragmentation granularity. If user only wants
>> 2MB mTHP, maybe pageblock order should be adjusted. Otherwise,
>> kernel will defragment at 512MB granularity, which might not be efficient.
>> Maybe make pageblock_order a boot time parameter?
>>
>> In addition, we are mixing two things together:
>> 1. min, low, and high watermarks: they affect when memory reclaim and compaction
>>    will be triggered;
>> 2. pageblock order: it is the granularity of defragmentation for creating
>>    mTHP/THP.
>>
>> In your use case, you want to lower watermarks, right? Considering what you
>> said below, I wonder if we want a way of enforcing vm.min_free_kbytes,
>> like a new sysctl knob, vm.force_min_free_kbytes (yeah the suggestion
>> is lame, sorry).
>
> Hmmm :>) I really think this is something we should do automatically.
>
> I know it's becoming silly as Usama and others have clearly demonstrated the 'T'
> in THP doesn't stand for transparent, but I think providing a new sysctl for an
> apparently automated system is not the way to go, especially as we intend to
> make it more automagic in future.

Right. I think current setting, which boosts watermarks based on THP sizes,
seems too conservative, implying we are so afraid of not being able to provide
a THP when there is not enough memory. But that prevents user from using all
available memory is silly. Maybe just get rid of the watermark change code
in khugepaged. If user wants to use all available memory, they pay the penalty
of not easily getting THPs from the system. Kernel should not make the decision
for user.

>
>>
>> I think for 2, we might want to decouple pageblock order from defragmentation
>> granularity.
>
> Well, isn't pageblock order explicitly a heuristic for defragmenting physical
> memory for the purposes of higher order allocations?
>
> I don't think we can decouple that.

Yes, but pageblock is also used for memory hotadd and hotremove as the minimal
unit, so bigger pageblock is not memory hotplug friendly. And the main use
is pageblock isolation to remove free pages from any possible user.

In terms of defragmentation, pageblock has two purposes: 1) pageblock size
matches THP size, so memory compaction can migrate in-use pages to create
an THP-size free page; 2) avoid mixing movable and unmovable pages to avoid
wasting memory compaction effort, since a single unmovable page makes
a whole pageblock not suitable for THP creation with the help of memory
compaction.

Now we have mTHP, whose sizes varies from order-1 (anon starts from order-2)
to PMD-order. But if user only wants a smaller size mTHP (like in this case
2MB mTHP in a system with 512MB THP), having a large pageblock might not be
efficient, since why defragmenting 512MB range for a 2MB mTHP.
I do not have data to support my claim yet, since it is possible that
defragmenting at > THP size range can provide better THP creation success
rate. So some study is needed to understand the impact of defragmentation
granularity on THP creation.

A single granularity, i.e., one pageblock size, which determines defragmentation
granularity, cannot rule all mTHP sizes. That is why I am thinking about decouple
pageblock size from defragmentation granularity.


>
> But I think we can say, as the existence of PAGE_BLOCK_MAX_ORDER already sort of
> implies, 'we are fine with increasing the chances of fragmentation of
> <ridiculously huge page size> in order to improve reclaim behaviour'.

Right, especially these huge page sizes are rarely used.

>
> And again really strikes me that the parameter to adjust here is pageblock size,
> maybe default max size for systems with very large page table size.

Short term, yes. Since watermarks are tied to pageblock size and the rationale
is that pageblock size is equal to THP size and we want to make some guarantee
on THP creation.
>
> The THP mechanism is meant to be 'best effort' and opportunistic right? So it's
> ok if we aren't quite perfect in providing crazy huge page sizes.

Yes. And changing pageblock size to lower watermarks give more available free
memory to user might be better than having guarantees on creating a rarely
used THP size.

>
> I think 'on arm64 64KB we give up on page block beyond sensible mTHP size' is
> really a fine thing to do, and implementable by just... changing max pageblock
> order :>)
>
> Not having pageblocks at the crazy size doesn't mean those regions won't exist,
> it just means they're more likely not to due to fragmentation.
>
> 512MB PMD's... man haha.

Right. One caveat is that pageblock size currently can only be changed via
Kconfig, so if user wants a different mTHP size than 2MB, they will need
to build a different kernel. Yes, we can make pageblock a boot time parameter
(I proposed it in Juan's patch). That implies if user wants a different mTHP
size, they need to reboot the machine. It is slightly better than kernel
compilation. Making pageblock size changeable at runtime might be too
complicated and involve a lot of runtime cost to merging and splitting pageblocks.
That is why I want to decouple pageblock from defragmentation granularity.
Yeah, it is going to be a big project. :)

>
>>
>>
>>>>
>>>> Another concern on tying watermarks to highest THP order is that if
>>>> user enables PMD THP on such systems with 2MB mTHP enabled initially,
>>>> it could trigger unexpected memory reclaim and compaction, right?
>>>> That might surprise user, since they just want to adjust availability
>>>> of THP sizes, but the whole system suddenly begins to be busy.
>>>> Have you experimented with it?
>>>>
>>>
>>> Yes I would imagine it would trigger reclaim and compaction if the system memory
>>> is too low, but that should hopefully be expected? If the user is enabling 512M
>>> THP, they should expect changes by kernel to allow them to give hugepage of
>>> that size.
>>> Also hopefully, no one is enabling PMD THPs when the system is so low on
>>> memory that it triggers reclaim! There would be an OOM after just a few
>>> of those are faulted in.


--
Best Regards,
Yan, Zi