[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9c1809f0-8c8b-4eb1-9ec8-6c00fe3097f1@lucifer.local>
Date: Wed, 4 Feb 2026 10:56:14 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Zi Yan <ziy@...dia.com>
Cc: David Hildenbrand <david@...nel.org>, Rik van Riel <riel@...riel.com>,
Usama Arif <usamaarif642@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
hannes@...xchg.org, shakeel.butt@...ux.dev, kas@...nel.org,
baohua@...nel.org, dev.jain@....com, baolin.wang@...ux.alibaba.com,
npache@...hat.com, Liam.Howlett@...cle.com, ryan.roberts@....com,
vbabka@...e.cz, lance.yang@...ux.dev, linux-kernel@...r.kernel.org,
kernel-team@...a.com, Frank van der Linden <fvdl@...gle.com>
Subject: Re: [RFC 00/12] mm: PUD (1GB) THP implementation
On Mon, Feb 02, 2026 at 10:50:35AM -0500, Zi Yan wrote:
> On 2 Feb 2026, at 6:30, Lorenzo Stoakes wrote:
>
> > On Sun, Feb 01, 2026 at 09:44:12PM -0500, Rik van Riel wrote:
> >> On Sun, 2026-02-01 at 16:50 -0800, Usama Arif wrote:
> >>>
> >>> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages
> >>> at boot
> >>> or runtime, taking memory away. This requires capacity planning,
> >>> administrative overhead, and makes workload orchastration much
> >>> much more
> >>> complex, especially colocating with workloads that don't use
> >>> hugetlbfs.
> >>>
> >> To address the obvious objection "but how could we
> >> possibly allocate 1GB huge pages while the workload
> >> is running?", I am planning to pick up the CMA balancing
> >> patch series (thank you, Frank) and get that in an
> >> upstream ready shape soon.
> >>
> >> https://lkml.org/2025/9/15/1735
> >
> > That link doesn't work?
> >
> > Did a quick search for CMA balancing on lore, couldn't find anything, could you
> > provide a lore link?
>
> https://lwn.net/Articles/1038263/
>
> >
> >>
> >> That patch set looks like another case where no
> >> amount of internal testing will find every single
> >> corner case, and we'll probably just want to
> >> merge it upstream, deploy it experimentally, and
> >> aggressively deal with anything that might pop up.
> >
> > I'm not really in favour of this kind of approach. There's plenty of things that
> > were considered 'temporary' upstream that became rather permanent :)
> >
> > Maybe we can't cover all corner-cases, but we need to make sure whatever we do
> > send upstream is maintainable, conceptually sensible and doesn't paint us into
> > any corners, etc.
> >
> >>
> >> With CMA balancing, it would be possibly to just
> >> have half (or even more) of system memory for
> >> movable allocations only, which would make it possible
> >> to allocate 1GB huge pages dynamically.
> >
> > Could you expand on that?
>
> I also would like to hear David’s opinion on using CMA for 1GB THP.
> He did not like it[1] when I posted my patch back in 2020, but it has
> been more than 5 years. :)
Yes please David :)
I find the idea of using the CMA for this a bit gross. And I fear we're
essentially expanding the hacks for DAX to everyone.
Again I really feel that we should be tackling technical debt here, rather
than adding features on shaky foundations and just making things worse.
We are inundated with series-after-series for THP trying to add features
but really not very many that are tackling this debt, and I think it's time
to get firmer about that.
>
> The other direction I explored is to get 1GB THP from buddy allocator.
> That means we need to:
> 1. bump MAX_PAGE_ORDER to 18 or make it a runtime variable so that only 1GB
> THP users need to bump it,
Would we need to bump the page block size too to stand more of a chance of
avoiding fragmentation?
Doing that though would result in reserves being way higher and thus more
memory used and we'd be in the territory of the unresolved issues with 64
KB page size kernels :)
> 2. handle cross memory section PFN merge in buddy allocator,
Ugh god...
> 3. improve anti-fragmentation mechanism for 1GB range compaction.
I think we'd really need something like this. Obviously there's the series
Rik refers to.
I mean CMA itself feels like a hack, though efforts are being made to at
least make it more robust (series mentioned, also the guaranteed CMA stuff
from Suren).
>
> 1 is easier-ish[2]. I have not looked into 2 and 3 much yet.
>
> [1] https://lore.kernel.org/all/52bc2d5d-eb8a-83de-1c93-abd329132d58@redhat.com/
> [2] https://lore.kernel.org/all/20210805190253.2795604-1-zi.yan@sent.com/
>
>
> Best Regards,
> Yan, Zi
Cheers, Lorenzo
Powered by blists - more mailing lists