[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f980e652-8e2a-41da-af9b-39fdd439fefc@lucifer.local>
Date: Mon, 9 Jun 2025 20:40:36 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Zi Yan <ziy@...dia.com>
Cc: Usama Arif <usamaarif642@...il.com>, david@...hat.com,
Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
hannes@...xchg.org, shakeel.butt@...ux.dev, riel@...riel.com,
baolin.wang@...ux.alibaba.com, Liam.Howlett@...cle.com,
npache@...hat.com, ryan.roberts@....com, dev.jain@....com,
hughd@...gle.com, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org, kernel-team@...a.com,
Juan Yescas <jyescas@...gle.com>, Breno Leitao <leitao@...ian.org>
Subject: Re: [RFC] mm: khugepaged: use largest enabled hugepage order for
min_free_kbytes
On Mon, Jun 09, 2025 at 11:20:04AM -0400, Zi Yan wrote:
> On 9 Jun 2025, at 10:50, Lorenzo Stoakes wrote:
>
> > On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote:
> >> On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote:
> >>
> >>> On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote:
> >
> > [snip]
> >
> >>>> So I guess the question is what should be the next step? The following has been discussed:
> >>>>
> >>>> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above
> >>>> and might have unintended consequences if done at runtime, so a no go?
> >>>> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi).
> >>>> The decoupling can be done separately. Watermark calculation can be decoupled using the
> >>>> approach taken in this RFC. Although max order used by pagecache needs to be addressed.
> >>>>
> >>>
> >>> I need to catch up with the thread (workload crazy atm), but why isn't it
> >>> feasible to simply statically adjust the pageblock size?
> >>>
> >>> The whole point of 'defragmentation' is to _heuristically_ make it less
> >>> likely there'll be fragmentation when requesting page blocks.
> >>>
> >>> And the watermark code is explicitly about providing reserves at a
> >>> _pageblock granularity_.
> >>>
> >>> Why would we want to 'defragment' to 512MB physically contiguous chunks
> >>> that we rarely use?
> >>>
> >>> Since it's all heuristic, it seems reasonable to me to cap it at a sensible
> >>> level no?
> >>
> >> What is a sensible level? 2MB is a good starting point. If we cap pageblock
> >> at 2MB, everyone should be happy at the moment. But if one user wants to
> >> allocate 4MB mTHP, they will most likely fail miserably, because pageblock
> >> is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB
> >> MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job.
> >>
> >> Defragmentation has two components: 1) pageblock, which has migratetypes
> >> to prevent mixing movable and unmovable pages, as a single unmovable page
> >> blocks large free pages from being created; 2) memory compaction granularity,
> >> which is the actual work to move pages around and form a large free pages.
> >> Currently, kernel assumes pageblock size = defragmentation granularity,
> >> but in reality, as long as pageblock size >= defragmentation granularity,
> >> memory compaction would still work, but not the other way around. So we
> >> need to choose pageblock size carefully to not break memory compaction.
> >
> > OK I get it - the issue is that compaction itself operations at a pageblock
> > granularity, and once you get so fragmented that compaction is critical to
> > defragmentation, you are stuck if the pageblock is not big enough.
>
> Right.
>
> >
> > Thing is, 512MB pageblock size for compaction seems insanely inefficient in
> > itself, and if we're complaining about issues with unavailable reserved
> > memory due to crazy PMD size, surely one will encounter the compaction
> > process simply failing to succeed/taking forever/causing issues with
> > reclaim/higher order folio allocation.
>
> Yep. Initially, we probably never thought PMD THP would be as large as
> 512MB.
Of course, such is the 'organic' nature of kernel development :)
>
> >
> > I mean, I don't really know the compaction code _at all_ (ran out of time
> > to cover in book ;), but is it all or-nothing? Does it grab a pageblock or
> > gives up?
>
> compaction works on one pageblock at a time, trying to migrate in-use pages
> within the pageblock away to create a free page for THP allocation.
> It assumes PMD THP size is equal to pageblock size. It will keep working
> until a PMD THP size free page is created. This is a very high level
> description, omitting a lot of details like how to avoid excessive compaction
> work, how to reduce compaction latency.
Yeah this matches my assumptions.
>
> >
> > Because it strikes me that a crazy pageblock size would cause really
> > serious system issues on that basis alone if that's the case.
> >
> > And again this leads me back to thinking it should just be the page block
> > size _as a whole_ that should be adjusted.
> >
> > Keep in mind a user can literally reduce the page block size already via
> > CONFIG_PAGE_BLOCK_MAX_ORDER.
> >
> > To me it seems that we should cap it at the highest _reasonable_ mTHP size
> > you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size
> > system.
> >
> > That way, people _can still get_ super huge PMD sized huge folios up to the
> > point of fragmentation.
> >
> > If we do reduce things this way we should give a config option to allow
> > users who truly want collosal PMD sizes with associated
> > watermarks/compaction to be able to still have it.
> >
> > CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something?
>
> I agree with capping pageblock size at a highest reasonable mTHP size.
> In case there is some user relying on this huge PMD THP, making
> pageblock a boot time variable might be a little better, since
> they do not need to recompile the kernel for their need, assuming
> distros will pick something like 2MB as the default pageblock size.
Right, this seems sensible, as long as we set a _default_ that limits to
whatever it would be, 2MB or such.
I don't think it's unreasonable to make that change since this 512 MB thing
is so entirely unexpected and unusual.
I think Usama said it would be a pain it working this way if it had to be
explicitly set as a boot time variable without defaulting like this.
>
> >
> > I also question this de-coupling in general (I may be missing somethig
> > however!) - the watermark code _very explicitly_ refers to providing
> > _pageblocks_ in order to ensure _defragmentation_ right?
>
> Yes. Since without enough free memory (bigger than a PMD THP),
> memory compaction will just do useless work.
Yeah right, so this is a key thing and why we need to rework the current
state of the patch.
>
> >
> > We would need to absolutely justify why it's suddenly ok to not provide
> > page blocks here.
> >
> > This is very very delicate code we have to be SO careful about.
> >
> > This is why I am being cautious here :)
>
> Understood. In theory, we can associate watermarks with THP allowed orders
> the other way around too, meaning if user lowers vm.min_free_kbytes,
> all THP/mTHP sizes bigger than the watermark threshold are disabled
> automatically. This could fix the memory compaction issues, but
> that might also drive user crazy as they cannot use the THP sizes
> they want.
Yeah that's interesting but I think that's just far too subtle and people will
have no idea what's going on.
I really think a hard cap, expressed in KB/MB, on pageblock size is the way to
go (but overrideable for people crazy enough to truly want 512 MB pages - and
who cannot then complain about watermarks).
>
> Often, user just ask for an impossible combination: they
> want to use all free memory, because they paid for it, and they
> want THPs, because they want max performance. When PMD THP is
> small like 2MB, the “unusable” free memory is not that noticeable,
> but when PMD THP is as large as 512MB, user just cannot unsee it. :)
Well, users asking for crazy things then being surprised when they get them
is nothing new :P
>
>
> Best Regards,
> Yan, Zi
Thanks for your input!
Cheers, Lorenzo
Powered by blists - more mailing lists