linux-kernel - Re: [PATCH v1 1/2] mm/page_alloc: Optimize free_contig

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <48E7570E-0C7A-4082-902E-52984A7781EB@nvidia.com>
Date: Mon, 12 Jan 2026 13:58:50 -0500
From: Zi Yan <ziy@...dia.com>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
 David Hildenbrand <david@...nel.org>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>,
 Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
 Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
 Brendan Jackman <jackmanb@...gle.com>, Johannes Weiner <hannes@...xchg.org>,
 Uladzislau Rezki <urezki@...il.com>,
 "Vishal Moola (Oracle)" <vishal.moola@...il.com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, Jiaqi Yan <jiaqiyan@...gle.com>
Subject: Re: [PATCH v1 1/2] mm/page_alloc: Optimize free_contig_range()

On 12 Jan 2026, at 13:21, Ryan Roberts wrote:

> On 12/01/2026 15:57, Zi Yan wrote:
>> On 12 Jan 2026, at 8:24, Ryan Roberts wrote:
>>
>>> Hi Zi,
>>>
>>> Sorry for slow response - I had some other high priority stuff come in...
>>>
>>>
>>> On 07/01/2026 03:32, Zi Yan wrote:
>>>> On 5 Jan 2026, at 12:31, Ryan Roberts wrote:
>>>>
>>>>> On 05/01/2026 17:15, Zi Yan wrote:
>>>>>> On 5 Jan 2026, at 11:17, Ryan Roberts wrote:
>>>>>>
>>>>>>> Decompose the range of order-0 pages to be freed into the set of largest
>>>>>>> possible power-of-2 size and aligned chunks and free them to the pcp or
>>>>>>> buddy. This improves on the previous approach which freed each order-0
>>>>>>> page individually in a loop. Testing shows performance to be improved by
>>>>>>> more than 10x in some cases.
>>>>>>>
>>>>>>> Since each page is order-0, we must decrement each page's reference
>>>>>>> count individually and only consider the page for freeing as part of a
>>>>>>> high order chunk if the reference count goes to zero. Additionally
>>>>>>> free_pages_prepare() must be called for each individual order-0 page
>>>>>>> too, so that the struct page state and global accounting state can be
>>>>>>> appropriately managed. But once this is done, the resulting high order
>>>>>>> chunks can be freed as a unit to the pcp or buddy.
>>>>>>>
>>>>>>> This significiantly speeds up the free operation but also has the side
>>>>>>> benefit that high order blocks are added to the pcp instead of each page
>>>>>>> ending up on the pcp order-0 list; memory remains more readily available
>>>>>>> in high orders.
>>>>>>>
>>>>>>> vmalloc will shortly become a user of this new optimized
>>>>>>> free_contig_range() since it agressively allocates high order
>>>>>>> non-compound pages, but then calls split_page() to end up with
>>>>>>> contiguous order-0 pages. These can now be freed much more efficiently.
>>>>>>>
>>>>>>> The execution time of the following function was measured in a VM on an
>>>>>>> Apple M2 system:
>>>>>>>
>>>>>>> static int page_alloc_high_ordr_test(void)
>>>>>>> {
>>>>>>> 	unsigned int order = HPAGE_PMD_ORDER;
>>>>>>> 	struct page *page;
>>>>>>> 	int i;
>>>>>>>
>>>>>>> 	for (i = 0; i < 100000; i++) {
>>>>>>> 		page = alloc_pages(GFP_KERNEL, order);
>>>>>>> 		if (!page)
>>>>>>> 			return -1;
>>>>>>> 		split_page(page, order);
>>>>>>> 		free_contig_range(page_to_pfn(page), 1UL << order);
>>>>>>> 	}
>>>>>>>
>>>>>>> 	return 0;
>>>>>>> }
>>>>>>>
>>>>>>> Execution time before: 1684366 usec
>>>>>>> Execution time after:   136216 usec
>>>>>>>
>>>>>>> Perf trace before:
>>>>>>>
>>>>>>>     60.93%     0.00%  kthreadd     [kernel.kallsyms]      [k] ret_from_fork
>>>>>>>             |
>>>>>>>             ---ret_from_fork
>>>>>>>                kthread
>>>>>>>                0xffffbba283e63980
>>>>>>>                |
>>>>>>>                |--60.01%--0xffffbba283e636dc
>>>>>>>                |          |
>>>>>>>                |          |--58.57%--free_contig_range
>>>>>>>                |          |          |
>>>>>>>                |          |          |--57.19%--___free_pages
>>>>>>>                |          |          |          |
>>>>>>>                |          |          |          |--46.65%--__free_frozen_pages
>>>>>>>                |          |          |          |          |
>>>>>>>                |          |          |          |          |--28.08%--free_pcppages_bulk
>>>>>>>                |          |          |          |          |
>>>>>>>                |          |          |          |           --12.05%--free_frozen_page_commit.constprop.0
>>>>>>>                |          |          |          |
>>>>>>>                |          |          |          |--5.10%--__get_pfnblock_flags_mask.isra.0
>>>>>>>                |          |          |          |
>>>>>>>                |          |          |          |--1.13%--_raw_spin_unlock
>>>>>>>                |          |          |          |
>>>>>>>                |          |          |          |--0.78%--free_frozen_page_commit.constprop.0
>>>>>>>                |          |          |          |
>>>>>>>                |          |          |           --0.75%--_raw_spin_trylock
>>>>>>>                |          |          |
>>>>>>>                |          |           --0.95%--__free_frozen_pages
>>>>>>>                |          |
>>>>>>>                |           --1.44%--___free_pages
>>>>>>>                |
>>>>>>>                 --0.78%--0xffffbba283e636c0
>>>>>>>                           split_page
>>>>>>>
>>>>>>> Perf trace after:
>>>>>>>
>>>>>>>     10.62%     0.00%  kthreadd     [kernel.kallsyms]  [k] ret_from_fork
>>>>>>>             |
>>>>>>>             ---ret_from_fork
>>>>>>>                kthread
>>>>>>>                0xffffbbd55ef74980
>>>>>>>                |
>>>>>>>                |--8.74%--0xffffbbd55ef746dc
>>>>>>>                |          free_contig_range
>>>>>>>                |          |
>>>>>>>                |           --8.72%--__free_contig_range
>>>>>>>                |
>>>>>>>                 --1.56%--0xffffbbd55ef746c0
>>>>>>>                           |
>>>>>>>                            --1.54%--split_page
>>>>>>>
>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@....com>
>>>>>>> ---
>>>>>>>  include/linux/gfp.h |   1 +
>>>>>>>  mm/page_alloc.c     | 116 +++++++++++++++++++++++++++++++++++++++-----
>>>>>>>  2 files changed, 106 insertions(+), 11 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>>>>>> index b155929af5b1..3ed0bef34d0c 100644
>>>>>>> --- a/include/linux/gfp.h
>>>>>>> +++ b/include/linux/gfp.h
>>>>>>> @@ -439,6 +439,7 @@ extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_
>>>>>>>  #define alloc_contig_pages(...)			alloc_hooks(alloc_contig_pages_noprof(__VA_ARGS__))
>>>>>>>
>>>>>>>  #endif
>>>>>>> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages);
>>>>>>>  void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>>>>>>>
>>>>>>>  #ifdef CONFIG_CONTIG_ALLOC
>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>> index a045d728ae0f..1015c8edf8a4 100644
>>>>>>> --- a/mm/page_alloc.c
>>>>>>> +++ b/mm/page_alloc.c
>>>>>>> @@ -91,6 +91,9 @@ typedef int __bitwise fpi_t;
>>>>>>>  /* Free the page without taking locks. Rely on trylock only. */
>>>>>>>  #define FPI_TRYLOCK		((__force fpi_t)BIT(2))
>>>>>>>
>>>>>>> +/* free_pages_prepare() has already been called for page(s) being freed. */
>>>>>>> +#define FPI_PREPARED		((__force fpi_t)BIT(3))
>>>>>>> +
>>>>>>>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>>>>>>>  static DEFINE_MUTEX(pcp_batch_high_lock);
>>>>>>>  #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
>>>>>>> @@ -1582,8 +1585,12 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>>>>>>>  	unsigned long pfn = page_to_pfn(page);
>>>>>>>  	struct zone *zone = page_zone(page);
>>>>>>>
>>>>>>> -	if (free_pages_prepare(page, order))
>>>>>>> -		free_one_page(zone, page, pfn, order, fpi_flags);
>>>>>>> +	if (!(fpi_flags & FPI_PREPARED)) {
>>>>>>> +		if (!free_pages_prepare(page, order))
>>>>>>> +			return;
>>>>>>> +	}
>>>>>>> +
>>>>>>> +	free_one_page(zone, page, pfn, order, fpi_flags);
>>>>>>>  }
>>>>>>>
>>>>>>>  void __meminit __free_pages_core(struct page *page, unsigned int order,
>>>>>>> @@ -2943,8 +2950,10 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
>>>>>>>  		return;
>>>>>>>  	}
>>>>>>>
>>>>>>> -	if (!free_pages_prepare(page, order))
>>>>>>> -		return;
>>>>>>> +	if (!(fpi_flags & FPI_PREPARED)) {
>>>>>>> +		if (!free_pages_prepare(page, order))
>>>>>>> +			return;
>>>>>>> +	}
>>>>>>>
>>>>>>>  	/*
>>>>>>>  	 * We only track unmovable, reclaimable and movable on pcp lists.
>>>>>>> @@ -7250,9 +7259,99 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
>>>>>>>  }
>>>>>>>  #endif /* CONFIG_CONTIG_ALLOC */
>>>>>>>
>>>>>>> +static void free_prepared_contig_range(struct page *page,
>>>>>>> +				       unsigned long nr_pages)
>>>>>>> +{
>>>>>>> +	while (nr_pages) {
>>>>>>> +		unsigned int fit_order, align_order, order;
>>>>>>> +		unsigned long pfn;
>>>>>>> +
>>>>>>> +		/*
>>>>>>> +		 * Find the largest aligned power-of-2 number of pages that
>>>>>>> +		 * starts at the current page, does not exceed nr_pages and is
>>>>>>> +		 * less than or equal to pageblock_order.
>>>>>>> +		 */
>>>>>>> +		pfn = page_to_pfn(page);
>>>>>>> +		fit_order = ilog2(nr_pages);
>>>>>>> +		align_order = pfn ? __ffs(pfn) : fit_order;
>>>>>>> +		order = min3(fit_order, align_order, pageblock_order);
>>>>>>> +
>>>>>>> +		/*
>>>>>>> +		 * Free the chunk as a single block. Our caller has already
>>>>>>> +		 * called free_pages_prepare() for each order-0 page.
>>>>>>> +		 */
>>>>>>> +		__free_frozen_pages(page, order, FPI_PREPARED);
>>>>>>> +
>>>>>>> +		page += 1UL << order;
>>>>>>> +		nr_pages -= 1UL << order;
>>>>>>> +	}
>>>>>>> +}
>>>>>>> +
>>>>>>> +/**
>>>>>>> + * __free_contig_range - Free contiguous range of order-0 pages.
>>>>>>> + * @pfn: Page frame number of the first page in the range.
>>>>>>> + * @nr_pages: Number of pages to free.
>>>>>>> + *
>>>>>>> + * For each order-0 struct page in the physically contiguous range, put a
>>>>>>> + * reference. Free any page who's reference count falls to zero. The
>>>>>>> + * implementation is functionally equivalent to, but significantly faster than
>>>>>>> + * calling __free_page() for each struct page in a loop.
>>>>>>> + *
>>>>>>> + * Memory allocated with alloc_pages(order>=1) then subsequently split to
>>>>>>> + * order-0 with split_page() is an example of appropriate contiguous pages that
>>>>>>> + * can be freed with this API.
>>>>>>> + *
>>>>>>> + * Returns the number of pages which were not freed, because their reference
>>>>>>> + * count did not fall to zero.
>>>>>>> + *
>>>>>>> + * Context: May be called in interrupt context or while holding a normal
>>>>>>> + * spinlock, but not in NMI context or while holding a raw spinlock.
>>>>>>> + */
>>>>>>> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages)
>>>>>>> +{
>>>>>>> +	struct page *page = pfn_to_page(pfn);
>>>>>>> +	unsigned long not_freed = 0;
>>>>>>> +	struct page *start = NULL;
>>>>>>> +	unsigned long i;
>>>>>>> +	bool can_free;
>>>>>>> +
>>>>>>> +	/*
>>>>>>> +	 * Chunk the range into contiguous runs of pages for which the refcount
>>>>>>> +	 * went to zero and for which free_pages_prepare() succeeded. If
>>>>>>> +	 * free_pages_prepare() fails we consider the page to have been freed
>>>>>>> +	 * deliberately leak it.
>>>>>>> +	 *
>>>>>>> +	 * Code assumes contiguous PFNs have contiguous struct pages, but not
>>>>>>> +	 * vice versa.
>>>>>>> +	 */
>>>>>>> +	for (i = 0; i < nr_pages; i++, page++) {
>>>>>>> +		VM_BUG_ON_PAGE(PageHead(page), page);
>>>>>>> +		VM_BUG_ON_PAGE(PageTail(page), page);
>>>>>>> +
>>>>>>> +		can_free = put_page_testzero(page);
>>>>>>> +		if (!can_free)
>>>>>>> +			not_freed++;
>>>>>>> +		else if (!free_pages_prepare(page, 0))
>>>>>>> +			can_free = false;
>>>>>>
>>>>>> I understand you use free_pages_prepare() here to catch early failures.
>>>>>> I wonder if we could let __free_frozen_pages() handle the failure of
>>>>>> non-compound >0 order pages instead of a new FPI flag.
>>>>>
>>>>> I'm not sure I follow. You would still need to provide a flag to
>>>>> __free_frozen_pages() to tell it "this is a set of order-0 pages". Otherwise it
>>>>> will treat it as a non-compound high order page, which would be wrong;
>>>>> free_pages_prepare() would only be called for the head page (with the order
>>>>> passed in) and that won't do the right thing.
>>>>>
>>>>> I guess you could pass the flag all the way to free_pages_prepare() then it
>>>>> could be modified to do the right thing for contiguous order-0 pages; that would
>>>>> probably ultimately be more efficient then calling free_pages_prepare() for
>>>>> every order-0 page. Is that what you are suggesting?
>>>>
>>>> Yes. I mistakenly mixed up non-compound high order page and a set of order-0
>>>> pages. There is alloc_pages_bulk() to get a list of order-0 pages, but
>>>> free_pages_bulk() does not exist. Maybe that is what we need here?
>>>
>>> This is what I initially started with; vmalloc maintains a list of struct pages
>>> so why not just pass that list to something like free_pages_bulk()? The problem
>>> is that the optimization relies on a *contiguous* set of order-0 pages. A list
>>> of pointers to pages does not imply they must be contiguous. So I don't think
>>> it's the right API.
>>>
>>> We already have free_contig_range() which takes a range of PFNs. That's the
>>> exact semantic we can optimize so surely that's the better API style? Hence I
>>> added __free_contig_range() and reimplemented free_contig_range() (which does
>>> more than vfree wants()) on top.
>>>
>>>> Using __free_frozen_pages() for a set of order-0 pages looks like a
>>>> shoehorning.
>>>
>>> Perhaps; that's an internal function so could rename to __free_frozen_order(),
>>> passing in a PFN and an order and teach it to handle all 3 cases internally:
>>>
>>>  - high-order non-compound page (already supports)
>>>  - high-order compound page (already supports)
>>>  - power-of-2 sized and aligned number of contiguos order-0 pages (new)
>>>
>>>> I admit that adding free_pages_bulk() with maximal code
>>>> reuse and a good interface will take some effort, so it probably is a long
>>>> term goal. free_pages_bulk() is also slightly different from what you
>>>> want to do, since, if it uses same interface as alloc_pages_bulk(),
>>>> it will need to accept a page array instead of page + order.
>>>
>>> Yeah; I don't think that's a good API for this case. We would spend more effort
>>> looking for contiguous pages when there are none. (for the vfree case it's
>>> highly likely that they are contiguous because that's how they were allocated).
>>>
>>
>> All above makes sense to me.
>>
>>>>
>>>> I am not suggesting you should do this, but just think out loud.
>>>>
>>>>>
>>>>>>
>>>>>> Looking at free_pages_prepare(), three cases would cause failures:
>>>>>> 1. PageHWPoison(page): the code excludes >0 order pages, so it needs
>>>>>>    to be fixed. BTW, Jiaqi Yan has a series trying to tackle it[1].
>>>>>>
>>>>>> 2. uncleared PageNetpp(page): probably need to check every individual
>>>>>>    page of this >0 order page and call bad_page() for any violator.
>>>>>>
>>>>>> 3. bad free page: probably need to do it for individual page as well.
>>>>>
>>>>> It's not just handling the failures, it's accounting; e.g.
>>>>> __memcg_kmem_uncharge_page().
>>>>
>>>> Got it. Another idea comes to mind.
>>>>
>>>> Is it doable to
>>>> 1) use put_page_testzero() to bring all pages’ refs to 0,
>>>> 2) unsplit/merge these contiguous order-0 pages back to non-compound
>>>>    high order pages,
>>>> 3) free unsplit/merged pages with __free_frozen_pages()?
>>>
>>> Yes, I thought about this approach. I think it would work, but I assumed it
>>> would be extra effort to merge the pages only to then free them. It might be in
>>> the noise though. Do you think this approach is better than what I suggest
>>> above? If so, I'll have a go.
>>
>> Yes, the symmetry also makes it easier to follow. Thanks.
>
> Having taken another look at this, I'm not sure it's the correct approach. We
> would need a bunch of extra checks to ensure the order-0 pages can be safely
> merged together; they need the same page owner, tag and memcg meta data. And we
> need to check stuff like poisoning, etc. Effectively this would turn into all
> the same stuff that free_pages_prepare() already does.

OK, it seems a lot of work. Thank you for looking into it.

>
> So I don't think merging is the way to go; I'd prefer to either do it as I've
> done it in this version of the series (call free_pages_prepare() per order-0,
> then __free_frozen_pages() for the whole block) or call free_pages_prepare()
> once for the whole block but tell it that it has order-0 pages and let it
> ~iterate over them internally.
>
> Can you be convinced?

Either works. I prefer the latter, but has no strong opinion on it.

Best Regards,
Yan, Zi