linux-kernel - Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AA2C4D68-B1DC-48A6-A807-56516067B9C7@nvidia.com>
Date: Mon, 09 Jun 2025 11:20:04 -0400
From: Zi Yan <ziy@...dia.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Usama Arif <usamaarif642@...il.com>, david@...hat.com,
 Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
 hannes@...xchg.org, shakeel.butt@...ux.dev, riel@...riel.com,
 baolin.wang@...ux.alibaba.com, Liam.Howlett@...cle.com, npache@...hat.com,
 ryan.roberts@....com, dev.jain@....com, hughd@...gle.com,
 linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
 kernel-team@...a.com, Juan Yescas <jyescas@...gle.com>,
 Breno Leitao <leitao@...ian.org>
Subject: Re: [RFC] mm: khugepaged: use largest enabled hugepage order for
 min_free_kbytes

On 9 Jun 2025, at 10:50, Lorenzo Stoakes wrote:

> On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote:
>> On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote:
>>
>>> On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote:
>
> [snip]
>
>>>> So I guess the question is what should be the next step? The following has been discussed:
>>>>
>>>> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above
>>>>   and might have unintended consequences if done at runtime, so a no go?
>>>> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi).
>>>>   The decoupling can be done separately. Watermark calculation can be decoupled using the
>>>>   approach taken in this RFC. Although max order used by pagecache needs to be addressed.
>>>>
>>>
>>> I need to catch up with the thread (workload crazy atm), but why isn't it
>>> feasible to simply statically adjust the pageblock size?
>>>
>>> The whole point of 'defragmentation' is to _heuristically_ make it less
>>> likely there'll be fragmentation when requesting page blocks.
>>>
>>> And the watermark code is explicitly about providing reserves at a
>>> _pageblock granularity_.
>>>
>>> Why would we want to 'defragment' to 512MB physically contiguous chunks
>>> that we rarely use?
>>>
>>> Since it's all heuristic, it seems reasonable to me to cap it at a sensible
>>> level no?
>>
>> What is a sensible level? 2MB is a good starting point. If we cap pageblock
>> at 2MB, everyone should be happy at the moment. But if one user wants to
>> allocate 4MB mTHP, they will most likely fail miserably, because pageblock
>> is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB
>> MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job.
>>
>> Defragmentation has two components: 1) pageblock, which has migratetypes
>> to prevent mixing movable and unmovable pages, as a single unmovable page
>> blocks large free pages from being created; 2) memory compaction granularity,
>> which is the actual work to move pages around and form a large free pages.
>> Currently, kernel assumes pageblock size = defragmentation granularity,
>> but in reality, as long as pageblock size >= defragmentation granularity,
>> memory compaction would still work, but not the other way around. So we
>> need to choose pageblock size carefully to not break memory compaction.
>
> OK I get it - the issue is that compaction itself operations at a pageblock
> granularity, and once you get so fragmented that compaction is critical to
> defragmentation, you are stuck if the pageblock is not big enough.

Right.

>
> Thing is, 512MB pageblock size for compaction seems insanely inefficient in
> itself, and if we're complaining about issues with unavailable reserved
> memory due to crazy PMD size, surely one will encounter the compaction
> process simply failing to succeed/taking forever/causing issues with
> reclaim/higher order folio allocation.

Yep. Initially, we probably never thought PMD THP would be as large as
512MB.

>
> I mean, I don't really know the compaction code _at all_ (ran out of time
> to cover in book ;), but is it all or-nothing? Does it grab a pageblock or
> gives up?

compaction works on one pageblock at a time, trying to migrate in-use pages
within the pageblock away to create a free page for THP allocation.
It assumes PMD THP size is equal to pageblock size. It will keep working
until a PMD THP size free page is created. This is a very high level
description, omitting a lot of details like how to avoid excessive compaction
work, how to reduce compaction latency.

>
> Because it strikes me that a crazy pageblock size would cause really
> serious system issues on that basis alone if that's the case.
>
> And again this leads me back to thinking it should just be the page block
> size _as a whole_ that should be adjusted.
>
> Keep in mind a user can literally reduce the page block size already via
> CONFIG_PAGE_BLOCK_MAX_ORDER.
>
> To me it seems that we should cap it at the highest _reasonable_ mTHP size
> you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size
> system.
>
> That way, people _can still get_ super huge PMD sized huge folios up to the
> point of fragmentation.
>
> If we do reduce things this way we should give a config option to allow
> users who truly want collosal PMD sizes with associated
> watermarks/compaction to be able to still have it.
>
> CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something?

I agree with capping pageblock size at a highest reasonable mTHP size.
In case there is some user relying on this huge PMD THP, making
pageblock a boot time variable might be a little better, since
they do not need to recompile the kernel for their need, assuming
distros will pick something like 2MB as the default pageblock size.

>
> I also question this de-coupling in general (I may be missing somethig
> however!) - the watermark code _very explicitly_ refers to providing
> _pageblocks_ in order to ensure _defragmentation_ right?

Yes. Since without enough free memory (bigger than a PMD THP),
memory compaction will just do useless work.

>
> We would need to absolutely justify why it's suddenly ok to not provide
> page blocks here.
>
> This is very very delicate code we have to be SO careful about.
>
> This is why I am being cautious here :)

Understood. In theory, we can associate watermarks with THP allowed orders
the other way around too, meaning if user lowers vm.min_free_kbytes,
all THP/mTHP sizes bigger than the watermark threshold are disabled
automatically. This could fix the memory compaction issues, but
that might also drive user crazy as they cannot use the THP sizes
they want.

Often, user just ask for an impossible combination: they
want to use all free memory, because they paid for it, and they
want THPs, because they want max performance. When PMD THP is
small like 2MB, the “unusable” free memory is not that noticeable,
but when PMD THP is as large as 512MB, user just cannot unsee it. :)


Best Regards,
Yan, Zi