linux-kernel - Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <61da7d25-f115-4be3-a09f-7696efe7f0ec@lucifer.local>
Date: Mon, 9 Jun 2025 15:50:34 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Zi Yan <ziy@...dia.com>
Cc: Usama Arif <usamaarif642@...il.com>, david@...hat.com,
        Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        hannes@...xchg.org, shakeel.butt@...ux.dev, riel@...riel.com,
        baolin.wang@...ux.alibaba.com, Liam.Howlett@...cle.com,
        npache@...hat.com, ryan.roberts@....com, dev.jain@....com,
        hughd@...gle.com, linux-kernel@...r.kernel.org,
        linux-doc@...r.kernel.org, kernel-team@...a.com,
        Juan Yescas <jyescas@...gle.com>, Breno Leitao <leitao@...ian.org>
Subject: Re: [RFC] mm: khugepaged: use largest enabled hugepage order for
 min_free_kbytes

On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote:
> On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote:
>
> > On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote:

[snip]

> >> So I guess the question is what should be the next step? The following has been discussed:
> >>
> >> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above
> >>   and might have unintended consequences if done at runtime, so a no go?
> >> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi).
> >>   The decoupling can be done separately. Watermark calculation can be decoupled using the
> >>   approach taken in this RFC. Although max order used by pagecache needs to be addressed.
> >>
> >
> > I need to catch up with the thread (workload crazy atm), but why isn't it
> > feasible to simply statically adjust the pageblock size?
> >
> > The whole point of 'defragmentation' is to _heuristically_ make it less
> > likely there'll be fragmentation when requesting page blocks.
> >
> > And the watermark code is explicitly about providing reserves at a
> > _pageblock granularity_.
> >
> > Why would we want to 'defragment' to 512MB physically contiguous chunks
> > that we rarely use?
> >
> > Since it's all heuristic, it seems reasonable to me to cap it at a sensible
> > level no?
>
> What is a sensible level? 2MB is a good starting point. If we cap pageblock
> at 2MB, everyone should be happy at the moment. But if one user wants to
> allocate 4MB mTHP, they will most likely fail miserably, because pageblock
> is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB
> MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job.
>
> Defragmentation has two components: 1) pageblock, which has migratetypes
> to prevent mixing movable and unmovable pages, as a single unmovable page
> blocks large free pages from being created; 2) memory compaction granularity,
> which is the actual work to move pages around and form a large free pages.
> Currently, kernel assumes pageblock size = defragmentation granularity,
> but in reality, as long as pageblock size >= defragmentation granularity,
> memory compaction would still work, but not the other way around. So we
> need to choose pageblock size carefully to not break memory compaction.

OK I get it - the issue is that compaction itself operations at a pageblock
granularity, and once you get so fragmented that compaction is critical to
defragmentation, you are stuck if the pageblock is not big enough.

Thing is, 512MB pageblock size for compaction seems insanely inefficient in
itself, and if we're complaining about issues with unavailable reserved
memory due to crazy PMD size, surely one will encounter the compaction
process simply failing to succeed/taking forever/causing issues with
reclaim/higher order folio allocation.

I mean, I don't really know the compaction code _at all_ (ran out of time
to cover in book ;), but is it all or-nothing? Does it grab a pageblock or
gives up?

Because it strikes me that a crazy pageblock size would cause really
serious system issues on that basis alone if that's the case.

And again this leads me back to thinking it should just be the page block
size _as a whole_ that should be adjusted.

Keep in mind a user can literally reduce the page block size already via
CONFIG_PAGE_BLOCK_MAX_ORDER.

To me it seems that we should cap it at the highest _reasonable_ mTHP size
you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size
system.

That way, people _can still get_ super huge PMD sized huge folios up to the
point of fragmentation.

If we do reduce things this way we should give a config option to allow
users who truly want collosal PMD sizes with associated
watermarks/compaction to be able to still have it.

CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something?

I also question this de-coupling in general (I may be missing somethig
however!) - the watermark code _very explicitly_ refers to providing
_pageblocks_ in order to ensure _defragmentation_ right?

We would need to absolutely justify why it's suddenly ok to not provide
page blocks here.

This is very very delicate code we have to be SO careful about.

This is why I am being cautious here :)

>
> Best Regards,
> Yan, Zi

Thanks!