[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aUQgfgWPq4ppMw9r@gourry-fedora-PF4VCD3F>
Date: Thu, 18 Dec 2025 10:40:46 -0500
From: Gregory Price <gourry@...rry.net>
To: "David Hildenbrand (Red Hat)" <david@...nel.org>
Cc: Frank van der Linden <fvdl@...gle.com>,
Johannes Weiner <hannes@...xchg.org>, linux-mm@...ck.org,
kernel-team@...a.com, linux-kernel@...r.kernel.org,
akpm@...ux-foundation.org, vbabka@...e.cz, surenb@...gle.com,
mhocko@...e.com, jackmanb@...gle.com, ziy@...dia.com,
kas@...nel.org, dave.hansen@...ux.intel.com,
rick.p.edgecombe@...el.com, muchun.song@...ux.dev,
osalvador@...e.de, x86@...nel.org, linux-coco@...ts.linux.dev,
kvm@...r.kernel.org, Wei Yang <richard.weiyang@...il.com>,
David Rientjes <rientjes@...gle.com>,
Joshua Hahn <joshua.hahnjy@...il.com>
Subject: Re: [PATCH v4] page_alloc: allow migration of smaller hugepages
during contig_alloc
On Wed, Dec 03, 2025 at 08:43:29PM +0100, David Hildenbrand (Red Hat) wrote:
> > Yeah, the function itself makes sense: "check if this is actually a
> > contiguous range available within this zone, so no holes and/or
> > reserved pages".
> >
> > The PageHuge() check seems a bit out of place there, if you just
> > removed it altogether you'd get the same results, right? The isolation
> > code will deal with it. But sure, it does potentially avoid doing some
> > unnecessary work.
In separate discussion with Johannes, he also noted that this allocation
code is the right place to do this check - as you might want to move a
1GB page if you're trying to reserve a specific region of memory.
So this much I'm confident in now. But going back to Mel's comment:
>
> commit 4d73ba5fa710fe7d432e0b271e6fecd252aef66e
> Author: Mel Gorman <mgorman@...hsingularity.net>
> Date: Fri Apr 14 15:14:29 2023 +0100
>
> mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages
> A bug was reported by Yuanxi Liu where allocating 1G pages at runtime is
> taking an excessive amount of time for large amounts of memory. Further
> testing allocating huge pages that the cost is linear i.e. if allocating
> 1G pages in batches of 10 then the time to allocate nr_hugepages from
> 10->20->30->etc increases linearly even though 10 pages are allocated at
> each step. Profiles indicated that much of the time is spent checking the
> validity within already existing huge pages and then attempting a
> migration that fails after isolating the range, draining pages and a whole
> lot of other useless work.
> Commit eb14d4eefdc4 ("mm,page_alloc: drop unnecessary checks from
> pfn_range_valid_contig") removed two checks, one which ignored huge pages
> for contiguous allocations as huge pages can sometimes migrate. While
> there may be value on migrating a 2M page to satisfy a 1G allocation, it's
> potentially expensive if the 1G allocation fails and it's pointless to try
> moving a 1G page for a new 1G allocation or scan the tail pages for valid
> PFNs.
> Reintroduce the PageHuge check and assume any contiguous region with
> hugetlbfs pages is unsuitable for a new 1G allocation.
>
Mel is pointing out that allowing 2MB region scans can cause 1GB page
allocation to take a very long time - specifically if no 2MB pages are
available as migration targets.
Joshua's test demonstrates at least that if the pages are reserved, the
migration code will move those reservations around accordingly. Now that
I look at it, it's unclear whether he tested if this still works when
those pages are actually reserved AND allocated.
I would presume we would end up in the position Mel describes (where
migrations fail and allocation takes a long time). That does seem
problematic unless we can reserve a new 2MB page outside the current
region and destroy the old one.
This at least would not cause a recursive call into this code as only
the gigantic page reservation interface hits this code.
So I'm at a bit of an impasse. I understand the performance issue here,
but being able to reliably allocate gigantic pages when a ton of 2MB
pages are already being used is also really nice.
Maybe we could do a first-pass / second-pass attempt where we filter on
PageHuge() on the first go, and then filter on (PageHuge() < alloc_size)
on the second go?
~Gregory
Powered by blists - more mailing lists