[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <48E7570E-0C7A-4082-902E-52984A7781EB@nvidia.com>
Date: Mon, 12 Jan 2026 13:58:50 -0500
From: Zi Yan <ziy@...dia.com>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...nel.org>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
Brendan Jackman <jackmanb@...gle.com>, Johannes Weiner <hannes@...xchg.org>,
Uladzislau Rezki <urezki@...il.com>,
"Vishal Moola (Oracle)" <vishal.moola@...il.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Jiaqi Yan <jiaqiyan@...gle.com>
Subject: Re: [PATCH v1 1/2] mm/page_alloc: Optimize free_contig_range()
On 12 Jan 2026, at 13:21, Ryan Roberts wrote:
> On 12/01/2026 15:57, Zi Yan wrote:
>> On 12 Jan 2026, at 8:24, Ryan Roberts wrote:
>>
>>> Hi Zi,
>>>
>>> Sorry for slow response - I had some other high priority stuff come in...
>>>
>>>
>>> On 07/01/2026 03:32, Zi Yan wrote:
>>>> On 5 Jan 2026, at 12:31, Ryan Roberts wrote:
>>>>
>>>>> On 05/01/2026 17:15, Zi Yan wrote:
>>>>>> On 5 Jan 2026, at 11:17, Ryan Roberts wrote:
>>>>>>
>>>>>>> Decompose the range of order-0 pages to be freed into the set of largest
>>>>>>> possible power-of-2 size and aligned chunks and free them to the pcp or
>>>>>>> buddy. This improves on the previous approach which freed each order-0
>>>>>>> page individually in a loop. Testing shows performance to be improved by
>>>>>>> more than 10x in some cases.
>>>>>>>
>>>>>>> Since each page is order-0, we must decrement each page's reference
>>>>>>> count individually and only consider the page for freeing as part of a
>>>>>>> high order chunk if the reference count goes to zero. Additionally
>>>>>>> free_pages_prepare() must be called for each individual order-0 page
>>>>>>> too, so that the struct page state and global accounting state can be
>>>>>>> appropriately managed. But once this is done, the resulting high order
>>>>>>> chunks can be freed as a unit to the pcp or buddy.
>>>>>>>
>>>>>>> This significiantly speeds up the free operation but also has the side
>>>>>>> benefit that high order blocks are added to the pcp instead of each page
>>>>>>> ending up on the pcp order-0 list; memory remains more readily available
>>>>>>> in high orders.
>>>>>>>
>>>>>>> vmalloc will shortly become a user of this new optimized
>>>>>>> free_contig_range() since it agressively allocates high order
>>>>>>> non-compound pages, but then calls split_page() to end up with
>>>>>>> contiguous order-0 pages. These can now be freed much more efficiently.
>>>>>>>
>>>>>>> The execution time of the following function was measured in a VM on an
>>>>>>> Apple M2 system:
>>>>>>>
>>>>>>> static int page_alloc_high_ordr_test(void)
>>>>>>> {
>>>>>>> unsigned int order = HPAGE_PMD_ORDER;
>>>>>>> struct page *page;
>>>>>>> int i;
>>>>>>>
>>>>>>> for (i = 0; i < 100000; i++) {
>>>>>>> page = alloc_pages(GFP_KERNEL, order);
>>>>>>> if (!page)
>>>>>>> return -1;
>>>>>>> split_page(page, order);
>>>>>>> free_contig_range(page_to_pfn(page), 1UL << order);
>>>>>>> }
>>>>>>>
>>>>>>> return 0;
>>>>>>> }
>>>>>>>
>>>>>>> Execution time before: 1684366 usec
>>>>>>> Execution time after: 136216 usec
>>>>>>>
>>>>>>> Perf trace before:
>>>>>>>
>>>>>>> 60.93% 0.00% kthreadd [kernel.kallsyms] [k] ret_from_fork
>>>>>>> |
>>>>>>> ---ret_from_fork
>>>>>>> kthread
>>>>>>> 0xffffbba283e63980
>>>>>>> |
>>>>>>> |--60.01%--0xffffbba283e636dc
>>>>>>> | |
>>>>>>> | |--58.57%--free_contig_range
>>>>>>> | | |
>>>>>>> | | |--57.19%--___free_pages
>>>>>>> | | | |
>>>>>>> | | | |--46.65%--__free_frozen_pages
>>>>>>> | | | | |
>>>>>>> | | | | |--28.08%--free_pcppages_bulk
>>>>>>> | | | | |
>>>>>>> | | | | --12.05%--free_frozen_page_commit.constprop.0
>>>>>>> | | | |
>>>>>>> | | | |--5.10%--__get_pfnblock_flags_mask.isra.0
>>>>>>> | | | |
>>>>>>> | | | |--1.13%--_raw_spin_unlock
>>>>>>> | | | |
>>>>>>> | | | |--0.78%--free_frozen_page_commit.constprop.0
>>>>>>> | | | |
>>>>>>> | | | --0.75%--_raw_spin_trylock
>>>>>>> | | |
>>>>>>> | | --0.95%--__free_frozen_pages
>>>>>>> | |
>>>>>>> | --1.44%--___free_pages
>>>>>>> |
>>>>>>> --0.78%--0xffffbba283e636c0
>>>>>>> split_page
>>>>>>>
>>>>>>> Perf trace after:
>>>>>>>
>>>>>>> 10.62% 0.00% kthreadd [kernel.kallsyms] [k] ret_from_fork
>>>>>>> |
>>>>>>> ---ret_from_fork
>>>>>>> kthread
>>>>>>> 0xffffbbd55ef74980
>>>>>>> |
>>>>>>> |--8.74%--0xffffbbd55ef746dc
>>>>>>> | free_contig_range
>>>>>>> | |
>>>>>>> | --8.72%--__free_contig_range
>>>>>>> |
>>>>>>> --1.56%--0xffffbbd55ef746c0
>>>>>>> |
>>>>>>> --1.54%--split_page
>>>>>>>
>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@....com>
>>>>>>> ---
>>>>>>> include/linux/gfp.h | 1 +
>>>>>>> mm/page_alloc.c | 116 +++++++++++++++++++++++++++++++++++++++-----
>>>>>>> 2 files changed, 106 insertions(+), 11 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>>>>>> index b155929af5b1..3ed0bef34d0c 100644
>>>>>>> --- a/include/linux/gfp.h
>>>>>>> +++ b/include/linux/gfp.h
>>>>>>> @@ -439,6 +439,7 @@ extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_
>>>>>>> #define alloc_contig_pages(...) alloc_hooks(alloc_contig_pages_noprof(__VA_ARGS__))
>>>>>>>
>>>>>>> #endif
>>>>>>> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages);
>>>>>>> void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>>>>>>>
>>>>>>> #ifdef CONFIG_CONTIG_ALLOC
>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>> index a045d728ae0f..1015c8edf8a4 100644
>>>>>>> --- a/mm/page_alloc.c
>>>>>>> +++ b/mm/page_alloc.c
>>>>>>> @@ -91,6 +91,9 @@ typedef int __bitwise fpi_t;
>>>>>>> /* Free the page without taking locks. Rely on trylock only. */
>>>>>>> #define FPI_TRYLOCK ((__force fpi_t)BIT(2))
>>>>>>>
>>>>>>> +/* free_pages_prepare() has already been called for page(s) being freed. */
>>>>>>> +#define FPI_PREPARED ((__force fpi_t)BIT(3))
>>>>>>> +
>>>>>>> /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>>>>>>> static DEFINE_MUTEX(pcp_batch_high_lock);
>>>>>>> #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
>>>>>>> @@ -1582,8 +1585,12 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>>>>>>> unsigned long pfn = page_to_pfn(page);
>>>>>>> struct zone *zone = page_zone(page);
>>>>>>>
>>>>>>> - if (free_pages_prepare(page, order))
>>>>>>> - free_one_page(zone, page, pfn, order, fpi_flags);
>>>>>>> + if (!(fpi_flags & FPI_PREPARED)) {
>>>>>>> + if (!free_pages_prepare(page, order))
>>>>>>> + return;
>>>>>>> + }
>>>>>>> +
>>>>>>> + free_one_page(zone, page, pfn, order, fpi_flags);
>>>>>>> }
>>>>>>>
>>>>>>> void __meminit __free_pages_core(struct page *page, unsigned int order,
>>>>>>> @@ -2943,8 +2950,10 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
>>>>>>> return;
>>>>>>> }
>>>>>>>
>>>>>>> - if (!free_pages_prepare(page, order))
>>>>>>> - return;
>>>>>>> + if (!(fpi_flags & FPI_PREPARED)) {
>>>>>>> + if (!free_pages_prepare(page, order))
>>>>>>> + return;
>>>>>>> + }
>>>>>>>
>>>>>>> /*
>>>>>>> * We only track unmovable, reclaimable and movable on pcp lists.
>>>>>>> @@ -7250,9 +7259,99 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
>>>>>>> }
>>>>>>> #endif /* CONFIG_CONTIG_ALLOC */
>>>>>>>
>>>>>>> +static void free_prepared_contig_range(struct page *page,
>>>>>>> + unsigned long nr_pages)
>>>>>>> +{
>>>>>>> + while (nr_pages) {
>>>>>>> + unsigned int fit_order, align_order, order;
>>>>>>> + unsigned long pfn;
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * Find the largest aligned power-of-2 number of pages that
>>>>>>> + * starts at the current page, does not exceed nr_pages and is
>>>>>>> + * less than or equal to pageblock_order.
>>>>>>> + */
>>>>>>> + pfn = page_to_pfn(page);
>>>>>>> + fit_order = ilog2(nr_pages);
>>>>>>> + align_order = pfn ? __ffs(pfn) : fit_order;
>>>>>>> + order = min3(fit_order, align_order, pageblock_order);
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * Free the chunk as a single block. Our caller has already
>>>>>>> + * called free_pages_prepare() for each order-0 page.
>>>>>>> + */
>>>>>>> + __free_frozen_pages(page, order, FPI_PREPARED);
>>>>>>> +
>>>>>>> + page += 1UL << order;
>>>>>>> + nr_pages -= 1UL << order;
>>>>>>> + }
>>>>>>> +}
>>>>>>> +
>>>>>>> +/**
>>>>>>> + * __free_contig_range - Free contiguous range of order-0 pages.
>>>>>>> + * @pfn: Page frame number of the first page in the range.
>>>>>>> + * @nr_pages: Number of pages to free.
>>>>>>> + *
>>>>>>> + * For each order-0 struct page in the physically contiguous range, put a
>>>>>>> + * reference. Free any page who's reference count falls to zero. The
>>>>>>> + * implementation is functionally equivalent to, but significantly faster than
>>>>>>> + * calling __free_page() for each struct page in a loop.
>>>>>>> + *
>>>>>>> + * Memory allocated with alloc_pages(order>=1) then subsequently split to
>>>>>>> + * order-0 with split_page() is an example of appropriate contiguous pages that
>>>>>>> + * can be freed with this API.
>>>>>>> + *
>>>>>>> + * Returns the number of pages which were not freed, because their reference
>>>>>>> + * count did not fall to zero.
>>>>>>> + *
>>>>>>> + * Context: May be called in interrupt context or while holding a normal
>>>>>>> + * spinlock, but not in NMI context or while holding a raw spinlock.
>>>>>>> + */
>>>>>>> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages)
>>>>>>> +{
>>>>>>> + struct page *page = pfn_to_page(pfn);
>>>>>>> + unsigned long not_freed = 0;
>>>>>>> + struct page *start = NULL;
>>>>>>> + unsigned long i;
>>>>>>> + bool can_free;
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * Chunk the range into contiguous runs of pages for which the refcount
>>>>>>> + * went to zero and for which free_pages_prepare() succeeded. If
>>>>>>> + * free_pages_prepare() fails we consider the page to have been freed
>>>>>>> + * deliberately leak it.
>>>>>>> + *
>>>>>>> + * Code assumes contiguous PFNs have contiguous struct pages, but not
>>>>>>> + * vice versa.
>>>>>>> + */
>>>>>>> + for (i = 0; i < nr_pages; i++, page++) {
>>>>>>> + VM_BUG_ON_PAGE(PageHead(page), page);
>>>>>>> + VM_BUG_ON_PAGE(PageTail(page), page);
>>>>>>> +
>>>>>>> + can_free = put_page_testzero(page);
>>>>>>> + if (!can_free)
>>>>>>> + not_freed++;
>>>>>>> + else if (!free_pages_prepare(page, 0))
>>>>>>> + can_free = false;
>>>>>>
>>>>>> I understand you use free_pages_prepare() here to catch early failures.
>>>>>> I wonder if we could let __free_frozen_pages() handle the failure of
>>>>>> non-compound >0 order pages instead of a new FPI flag.
>>>>>
>>>>> I'm not sure I follow. You would still need to provide a flag to
>>>>> __free_frozen_pages() to tell it "this is a set of order-0 pages". Otherwise it
>>>>> will treat it as a non-compound high order page, which would be wrong;
>>>>> free_pages_prepare() would only be called for the head page (with the order
>>>>> passed in) and that won't do the right thing.
>>>>>
>>>>> I guess you could pass the flag all the way to free_pages_prepare() then it
>>>>> could be modified to do the right thing for contiguous order-0 pages; that would
>>>>> probably ultimately be more efficient then calling free_pages_prepare() for
>>>>> every order-0 page. Is that what you are suggesting?
>>>>
>>>> Yes. I mistakenly mixed up non-compound high order page and a set of order-0
>>>> pages. There is alloc_pages_bulk() to get a list of order-0 pages, but
>>>> free_pages_bulk() does not exist. Maybe that is what we need here?
>>>
>>> This is what I initially started with; vmalloc maintains a list of struct pages
>>> so why not just pass that list to something like free_pages_bulk()? The problem
>>> is that the optimization relies on a *contiguous* set of order-0 pages. A list
>>> of pointers to pages does not imply they must be contiguous. So I don't think
>>> it's the right API.
>>>
>>> We already have free_contig_range() which takes a range of PFNs. That's the
>>> exact semantic we can optimize so surely that's the better API style? Hence I
>>> added __free_contig_range() and reimplemented free_contig_range() (which does
>>> more than vfree wants()) on top.
>>>
>>>> Using __free_frozen_pages() for a set of order-0 pages looks like a
>>>> shoehorning.
>>>
>>> Perhaps; that's an internal function so could rename to __free_frozen_order(),
>>> passing in a PFN and an order and teach it to handle all 3 cases internally:
>>>
>>> - high-order non-compound page (already supports)
>>> - high-order compound page (already supports)
>>> - power-of-2 sized and aligned number of contiguos order-0 pages (new)
>>>
>>>> I admit that adding free_pages_bulk() with maximal code
>>>> reuse and a good interface will take some effort, so it probably is a long
>>>> term goal. free_pages_bulk() is also slightly different from what you
>>>> want to do, since, if it uses same interface as alloc_pages_bulk(),
>>>> it will need to accept a page array instead of page + order.
>>>
>>> Yeah; I don't think that's a good API for this case. We would spend more effort
>>> looking for contiguous pages when there are none. (for the vfree case it's
>>> highly likely that they are contiguous because that's how they were allocated).
>>>
>>
>> All above makes sense to me.
>>
>>>>
>>>> I am not suggesting you should do this, but just think out loud.
>>>>
>>>>>
>>>>>>
>>>>>> Looking at free_pages_prepare(), three cases would cause failures:
>>>>>> 1. PageHWPoison(page): the code excludes >0 order pages, so it needs
>>>>>> to be fixed. BTW, Jiaqi Yan has a series trying to tackle it[1].
>>>>>>
>>>>>> 2. uncleared PageNetpp(page): probably need to check every individual
>>>>>> page of this >0 order page and call bad_page() for any violator.
>>>>>>
>>>>>> 3. bad free page: probably need to do it for individual page as well.
>>>>>
>>>>> It's not just handling the failures, it's accounting; e.g.
>>>>> __memcg_kmem_uncharge_page().
>>>>
>>>> Got it. Another idea comes to mind.
>>>>
>>>> Is it doable to
>>>> 1) use put_page_testzero() to bring all pages’ refs to 0,
>>>> 2) unsplit/merge these contiguous order-0 pages back to non-compound
>>>> high order pages,
>>>> 3) free unsplit/merged pages with __free_frozen_pages()?
>>>
>>> Yes, I thought about this approach. I think it would work, but I assumed it
>>> would be extra effort to merge the pages only to then free them. It might be in
>>> the noise though. Do you think this approach is better than what I suggest
>>> above? If so, I'll have a go.
>>
>> Yes, the symmetry also makes it easier to follow. Thanks.
>
> Having taken another look at this, I'm not sure it's the correct approach. We
> would need a bunch of extra checks to ensure the order-0 pages can be safely
> merged together; they need the same page owner, tag and memcg meta data. And we
> need to check stuff like poisoning, etc. Effectively this would turn into all
> the same stuff that free_pages_prepare() already does.
OK, it seems a lot of work. Thank you for looking into it.
>
> So I don't think merging is the way to go; I'd prefer to either do it as I've
> done it in this version of the series (call free_pages_prepare() per order-0,
> then __free_frozen_pages() for the whole block) or call free_pages_prepare()
> once for the whole block but tell it that it has order-0 pages and let it
> ~iterate over them internally.
>
> Can you be convinced?
Either works. I prefer the latter, but has no strong opinion on it.
Best Regards,
Yan, Zi
Powered by blists - more mailing lists