lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <D3992D01-63C9-42A2-A0C4-0A119AAC1BE2@nvidia.com>
Date: Mon, 09 Jun 2025 16:24:33 -0400
From: Zi Yan <ziy@...dia.com>
To: Usama Arif <usamaarif642@...il.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, david@...hat.com,
 Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
 hannes@...xchg.org, shakeel.butt@...ux.dev, riel@...riel.com,
 baolin.wang@...ux.alibaba.com, Liam.Howlett@...cle.com, npache@...hat.com,
 ryan.roberts@....com, dev.jain@....com, hughd@...gle.com,
 linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
 kernel-team@...a.com, Juan Yescas <jyescas@...gle.com>,
 Breno Leitao <leitao@...ian.org>
Subject: Re: [RFC] mm: khugepaged: use largest enabled hugepage order for
 min_free_kbytes

On 9 Jun 2025, at 16:03, Usama Arif wrote:

> On 09/06/2025 20:49, Zi Yan wrote:
>> On 9 Jun 2025, at 15:40, Lorenzo Stoakes wrote:
>>
>>> On Mon, Jun 09, 2025 at 11:20:04AM -0400, Zi Yan wrote:
>>>> On 9 Jun 2025, at 10:50, Lorenzo Stoakes wrote:
>>>>
>>>>> On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote:
>>>>>> On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote:
>>>>>>
>>>>>>> On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote:
>>>>>
>>>>> [snip]
>>>>>
>>>>>>>> So I guess the question is what should be the next step? The following has been discussed:
>>>>>>>>
>>>>>>>> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above
>>>>>>>>   and might have unintended consequences if done at runtime, so a no go?
>>>>>>>> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi).
>>>>>>>>   The decoupling can be done separately. Watermark calculation can be decoupled using the
>>>>>>>>   approach taken in this RFC. Although max order used by pagecache needs to be addressed.
>>>>>>>>
>>>>>>>
>>>>>>> I need to catch up with the thread (workload crazy atm), but why isn't it
>>>>>>> feasible to simply statically adjust the pageblock size?
>>>>>>>
>>>>>>> The whole point of 'defragmentation' is to _heuristically_ make it less
>>>>>>> likely there'll be fragmentation when requesting page blocks.
>>>>>>>
>>>>>>> And the watermark code is explicitly about providing reserves at a
>>>>>>> _pageblock granularity_.
>>>>>>>
>>>>>>> Why would we want to 'defragment' to 512MB physically contiguous chunks
>>>>>>> that we rarely use?
>>>>>>>
>>>>>>> Since it's all heuristic, it seems reasonable to me to cap it at a sensible
>>>>>>> level no?
>>>>>>
>>>>>> What is a sensible level? 2MB is a good starting point. If we cap pageblock
>>>>>> at 2MB, everyone should be happy at the moment. But if one user wants to
>>>>>> allocate 4MB mTHP, they will most likely fail miserably, because pageblock
>>>>>> is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB
>>>>>> MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job.
>>>>>>
>>>>>> Defragmentation has two components: 1) pageblock, which has migratetypes
>>>>>> to prevent mixing movable and unmovable pages, as a single unmovable page
>>>>>> blocks large free pages from being created; 2) memory compaction granularity,
>>>>>> which is the actual work to move pages around and form a large free pages.
>>>>>> Currently, kernel assumes pageblock size = defragmentation granularity,
>>>>>> but in reality, as long as pageblock size >= defragmentation granularity,
>>>>>> memory compaction would still work, but not the other way around. So we
>>>>>> need to choose pageblock size carefully to not break memory compaction.
>>>>>
>>>>> OK I get it - the issue is that compaction itself operations at a pageblock
>>>>> granularity, and once you get so fragmented that compaction is critical to
>>>>> defragmentation, you are stuck if the pageblock is not big enough.
>>>>
>>>> Right.
>>>>
>>>>>
>>>>> Thing is, 512MB pageblock size for compaction seems insanely inefficient in
>>>>> itself, and if we're complaining about issues with unavailable reserved
>>>>> memory due to crazy PMD size, surely one will encounter the compaction
>>>>> process simply failing to succeed/taking forever/causing issues with
>>>>> reclaim/higher order folio allocation.
>>>>
>>>> Yep. Initially, we probably never thought PMD THP would be as large as
>>>> 512MB.
>>>
>>> Of course, such is the 'organic' nature of kernel development :)
>>>
>>>>
>>>>>
>>>>> I mean, I don't really know the compaction code _at all_ (ran out of time
>>>>> to cover in book ;), but is it all or-nothing? Does it grab a pageblock or
>>>>> gives up?
>>>>
>>>> compaction works on one pageblock at a time, trying to migrate in-use pages
>>>> within the pageblock away to create a free page for THP allocation.
>>>> It assumes PMD THP size is equal to pageblock size. It will keep working
>>>> until a PMD THP size free page is created. This is a very high level
>>>> description, omitting a lot of details like how to avoid excessive compaction
>>>> work, how to reduce compaction latency.
>>>
>>> Yeah this matches my assumptions.
>>>
>>>>
>>>>>
>>>>> Because it strikes me that a crazy pageblock size would cause really
>>>>> serious system issues on that basis alone if that's the case.
>>>>>
>>>>> And again this leads me back to thinking it should just be the page block
>>>>> size _as a whole_ that should be adjusted.
>>>>>
>>>>> Keep in mind a user can literally reduce the page block size already via
>>>>> CONFIG_PAGE_BLOCK_MAX_ORDER.
>>>>>
>>>>> To me it seems that we should cap it at the highest _reasonable_ mTHP size
>>>>> you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size
>>>>> system.
>>>>>
>>>>> That way, people _can still get_ super huge PMD sized huge folios up to the
>>>>> point of fragmentation.
>>>>>
>>>>> If we do reduce things this way we should give a config option to allow
>>>>> users who truly want collosal PMD sizes with associated
>>>>> watermarks/compaction to be able to still have it.
>>>>>
>>>>> CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something?
>>>>
>>>> I agree with capping pageblock size at a highest reasonable mTHP size.
>>>> In case there is some user relying on this huge PMD THP, making
>>>> pageblock a boot time variable might be a little better, since
>>>> they do not need to recompile the kernel for their need, assuming
>>>> distros will pick something like 2MB as the default pageblock size.
>>>
>>> Right, this seems sensible, as long as we set a _default_ that limits to
>>> whatever it would be, 2MB or such.
>>>
>>> I don't think it's unreasonable to make that change since this 512 MB thing
>>> is so entirely unexpected and unusual.
>>>
>>> I think Usama said it would be a pain it working this way if it had to be
>>> explicitly set as a boot time variable without defaulting like this.
>>>
>>>>
>>>>>
>>>>> I also question this de-coupling in general (I may be missing somethig
>>>>> however!) - the watermark code _very explicitly_ refers to providing
>>>>> _pageblocks_ in order to ensure _defragmentation_ right?
>>>>
>>>> Yes. Since without enough free memory (bigger than a PMD THP),
>>>> memory compaction will just do useless work.
>>>
>>> Yeah right, so this is a key thing and why we need to rework the current
>>> state of the patch.
>>>
>>>>
>>>>>
>>>>> We would need to absolutely justify why it's suddenly ok to not provide
>>>>> page blocks here.
>>>>>
>>>>> This is very very delicate code we have to be SO careful about.
>>>>>
>>>>> This is why I am being cautious here :)
>>>>
>>>> Understood. In theory, we can associate watermarks with THP allowed orders
>>>> the other way around too, meaning if user lowers vm.min_free_kbytes,
>>>> all THP/mTHP sizes bigger than the watermark threshold are disabled
>>>> automatically. This could fix the memory compaction issues, but
>>>> that might also drive user crazy as they cannot use the THP sizes
>>>> they want.
>>>
>>> Yeah that's interesting but I think that's just far too subtle and people will
>>> have no idea what's going on.
>>>
>>> I really think a hard cap, expressed in KB/MB, on pageblock size is the way to
>>> go (but overrideable for people crazy enough to truly want 512 MB pages - and
>>> who cannot then complain about watermarks).
>>
>> I agree. Basically, I am thinking:
>> 1) use something like 2MB as default pageblock size for all arch (the value can
>> be set differently if some arch wants a different pageblock size due to other reasons), this can be done by modifying PAGE_BLOCK_MAX_ORDER’s default
>> value;
>>
>> 2) make pageblock_order a boot time parameter, so that user who wants
>> 512MB pages can still get it by changing pageblock order at boot time.
>>
>> WDYT?
>>
>
> I was really hoping we would come up with a dynamic way of doing this,
> especially one that doesn't require any more input from the user apart
> from just setting the mTHP size via sysfs..

Then we will need to get rid of pageblock size from both watermark calculation
and memory compaction and think about a new anti-fragmentation mechanism
to handle unmovable pages as current pageblock based mechanism no longer
fit the need.

What you are expecting is:
1) watermarks should change as the largest enabled THP/mTHP size changes;
2) memory compaction targets the largest enabled THP/mTHP size (next step
would improve memory compaction to optimize for all enabled sizes);
3) partitions of movable and unmovable pages can change dynamically
based on the largest enabled THP/mTHP size;
4) pageblock size becomes irrelevant.

>
> 1) in a way is already done. We can set it to 2M by setting
> ARCH_FORCE_MAX_ORDER to 5:
>
> In arch/arm64/Kconfig we already have:
>
> config ARCH_FORCE_MAX_ORDER
> 	int
> 	default "13" if ARM64_64K_PAGES
> 	default "11" if ARM64_16K_PAGES
> 	default "10"

Nah, that means user no longer can allocate pages larger than 2MB,
because the cap is in the buddy allocator.

>
> Doing 2) would require reboot and doing this just for changing mTHP size
> will probably be a nightmare for workload orchestration.

No. That is not what I mean. pageblock_order set at boot time only limits
the largest mTHP size. By default, user can get up to 2MB THP/mTHP,
but if they want to get 512MB THP, they can reboot with a larger pageblock
order and they can still use 2MB mTHP. The downside is that with
larger pageblock order, user cannot get the optimal THP/mTHP performance
kernel is designed to achieve.

Best Regards,
Yan, Zi

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ