linux-kernel - Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2338896F-7F86-4F5A-A3CC-D14459B8F227@nvidia.com>
Date: Mon, 09 Jun 2025 15:49:52 -0400
From: Zi Yan <ziy@...dia.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Usama Arif <usamaarif642@...il.com>, david@...hat.com,
 Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
 hannes@...xchg.org, shakeel.butt@...ux.dev, riel@...riel.com,
 baolin.wang@...ux.alibaba.com, Liam.Howlett@...cle.com, npache@...hat.com,
 ryan.roberts@....com, dev.jain@....com, hughd@...gle.com,
 linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
 kernel-team@...a.com, Juan Yescas <jyescas@...gle.com>,
 Breno Leitao <leitao@...ian.org>
Subject: Re: [RFC] mm: khugepaged: use largest enabled hugepage order for
 min_free_kbytes

On 9 Jun 2025, at 15:40, Lorenzo Stoakes wrote:

> On Mon, Jun 09, 2025 at 11:20:04AM -0400, Zi Yan wrote:
>> On 9 Jun 2025, at 10:50, Lorenzo Stoakes wrote:
>>
>>> On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote:
>>>> On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote:
>>>>
>>>>> On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote:
>>>
>>> [snip]
>>>
>>>>>> So I guess the question is what should be the next step? The following has been discussed:
>>>>>>
>>>>>> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above
>>>>>>   and might have unintended consequences if done at runtime, so a no go?
>>>>>> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi).
>>>>>>   The decoupling can be done separately. Watermark calculation can be decoupled using the
>>>>>>   approach taken in this RFC. Although max order used by pagecache needs to be addressed.
>>>>>>
>>>>>
>>>>> I need to catch up with the thread (workload crazy atm), but why isn't it
>>>>> feasible to simply statically adjust the pageblock size?
>>>>>
>>>>> The whole point of 'defragmentation' is to _heuristically_ make it less
>>>>> likely there'll be fragmentation when requesting page blocks.
>>>>>
>>>>> And the watermark code is explicitly about providing reserves at a
>>>>> _pageblock granularity_.
>>>>>
>>>>> Why would we want to 'defragment' to 512MB physically contiguous chunks
>>>>> that we rarely use?
>>>>>
>>>>> Since it's all heuristic, it seems reasonable to me to cap it at a sensible
>>>>> level no?
>>>>
>>>> What is a sensible level? 2MB is a good starting point. If we cap pageblock
>>>> at 2MB, everyone should be happy at the moment. But if one user wants to
>>>> allocate 4MB mTHP, they will most likely fail miserably, because pageblock
>>>> is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB
>>>> MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job.
>>>>
>>>> Defragmentation has two components: 1) pageblock, which has migratetypes
>>>> to prevent mixing movable and unmovable pages, as a single unmovable page
>>>> blocks large free pages from being created; 2) memory compaction granularity,
>>>> which is the actual work to move pages around and form a large free pages.
>>>> Currently, kernel assumes pageblock size = defragmentation granularity,
>>>> but in reality, as long as pageblock size >= defragmentation granularity,
>>>> memory compaction would still work, but not the other way around. So we
>>>> need to choose pageblock size carefully to not break memory compaction.
>>>
>>> OK I get it - the issue is that compaction itself operations at a pageblock
>>> granularity, and once you get so fragmented that compaction is critical to
>>> defragmentation, you are stuck if the pageblock is not big enough.
>>
>> Right.
>>
>>>
>>> Thing is, 512MB pageblock size for compaction seems insanely inefficient in
>>> itself, and if we're complaining about issues with unavailable reserved
>>> memory due to crazy PMD size, surely one will encounter the compaction
>>> process simply failing to succeed/taking forever/causing issues with
>>> reclaim/higher order folio allocation.
>>
>> Yep. Initially, we probably never thought PMD THP would be as large as
>> 512MB.
>
> Of course, such is the 'organic' nature of kernel development :)
>
>>
>>>
>>> I mean, I don't really know the compaction code _at all_ (ran out of time
>>> to cover in book ;), but is it all or-nothing? Does it grab a pageblock or
>>> gives up?
>>
>> compaction works on one pageblock at a time, trying to migrate in-use pages
>> within the pageblock away to create a free page for THP allocation.
>> It assumes PMD THP size is equal to pageblock size. It will keep working
>> until a PMD THP size free page is created. This is a very high level
>> description, omitting a lot of details like how to avoid excessive compaction
>> work, how to reduce compaction latency.
>
> Yeah this matches my assumptions.
>
>>
>>>
>>> Because it strikes me that a crazy pageblock size would cause really
>>> serious system issues on that basis alone if that's the case.
>>>
>>> And again this leads me back to thinking it should just be the page block
>>> size _as a whole_ that should be adjusted.
>>>
>>> Keep in mind a user can literally reduce the page block size already via
>>> CONFIG_PAGE_BLOCK_MAX_ORDER.
>>>
>>> To me it seems that we should cap it at the highest _reasonable_ mTHP size
>>> you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size
>>> system.
>>>
>>> That way, people _can still get_ super huge PMD sized huge folios up to the
>>> point of fragmentation.
>>>
>>> If we do reduce things this way we should give a config option to allow
>>> users who truly want collosal PMD sizes with associated
>>> watermarks/compaction to be able to still have it.
>>>
>>> CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something?
>>
>> I agree with capping pageblock size at a highest reasonable mTHP size.
>> In case there is some user relying on this huge PMD THP, making
>> pageblock a boot time variable might be a little better, since
>> they do not need to recompile the kernel for their need, assuming
>> distros will pick something like 2MB as the default pageblock size.
>
> Right, this seems sensible, as long as we set a _default_ that limits to
> whatever it would be, 2MB or such.
>
> I don't think it's unreasonable to make that change since this 512 MB thing
> is so entirely unexpected and unusual.
>
> I think Usama said it would be a pain it working this way if it had to be
> explicitly set as a boot time variable without defaulting like this.
>
>>
>>>
>>> I also question this de-coupling in general (I may be missing somethig
>>> however!) - the watermark code _very explicitly_ refers to providing
>>> _pageblocks_ in order to ensure _defragmentation_ right?
>>
>> Yes. Since without enough free memory (bigger than a PMD THP),
>> memory compaction will just do useless work.
>
> Yeah right, so this is a key thing and why we need to rework the current
> state of the patch.
>
>>
>>>
>>> We would need to absolutely justify why it's suddenly ok to not provide
>>> page blocks here.
>>>
>>> This is very very delicate code we have to be SO careful about.
>>>
>>> This is why I am being cautious here :)
>>
>> Understood. In theory, we can associate watermarks with THP allowed orders
>> the other way around too, meaning if user lowers vm.min_free_kbytes,
>> all THP/mTHP sizes bigger than the watermark threshold are disabled
>> automatically. This could fix the memory compaction issues, but
>> that might also drive user crazy as they cannot use the THP sizes
>> they want.
>
> Yeah that's interesting but I think that's just far too subtle and people will
> have no idea what's going on.
>
> I really think a hard cap, expressed in KB/MB, on pageblock size is the way to
> go (but overrideable for people crazy enough to truly want 512 MB pages - and
> who cannot then complain about watermarks).

I agree. Basically, I am thinking:
1) use something like 2MB as default pageblock size for all arch (the value can
be set differently if some arch wants a different pageblock size due to other reasons), this can be done by modifying PAGE_BLOCK_MAX_ORDER’s default
value;

2) make pageblock_order a boot time parameter, so that user who wants
512MB pages can still get it by changing pageblock order at boot time.

WDYT?

>
>>
>> Often, user just ask for an impossible combination: they
>> want to use all free memory, because they paid for it, and they
>> want THPs, because they want max performance. When PMD THP is
>> small like 2MB, the “unusable” free memory is not that noticeable,
>> but when PMD THP is as large as 512MB, user just cannot unsee it. :)
>
> Well, users asking for crazy things then being surprised when they get them
> is nothing new :P
>
>>
>>
>> Best Regards,
>> Yan, Zi
>
> Thanks for your input!
>
> Cheers, Lorenzo


Best Regards,
Yan, Zi