linux-kernel - Re: [PATCH v1 1/2] mm/page_alloc: Optimize free_contig

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <89dfd2d0-cf28-41c7-be9f-b49963e77aa5@arm.com>
Date: Mon, 12 Jan 2026 13:24:55 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: Zi Yan <ziy@...dia.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
 David Hildenbrand <david@...nel.org>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>, Vlastimil Babka
 <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
 Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
 Brendan Jackman <jackmanb@...gle.com>, Johannes Weiner <hannes@...xchg.org>,
 Uladzislau Rezki <urezki@...il.com>,
 "Vishal Moola (Oracle)" <vishal.moola@...il.com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, Jiaqi Yan <jiaqiyan@...gle.com>
Subject: Re: [PATCH v1 1/2] mm/page_alloc: Optimize free_contig_range()

Hi Zi,

Sorry for slow response - I had some other high priority stuff come in...


On 07/01/2026 03:32, Zi Yan wrote:
> On 5 Jan 2026, at 12:31, Ryan Roberts wrote:
> 
>> On 05/01/2026 17:15, Zi Yan wrote:
>>> On 5 Jan 2026, at 11:17, Ryan Roberts wrote:
>>>
>>>> Decompose the range of order-0 pages to be freed into the set of largest
>>>> possible power-of-2 size and aligned chunks and free them to the pcp or
>>>> buddy. This improves on the previous approach which freed each order-0
>>>> page individually in a loop. Testing shows performance to be improved by
>>>> more than 10x in some cases.
>>>>
>>>> Since each page is order-0, we must decrement each page's reference
>>>> count individually and only consider the page for freeing as part of a
>>>> high order chunk if the reference count goes to zero. Additionally
>>>> free_pages_prepare() must be called for each individual order-0 page
>>>> too, so that the struct page state and global accounting state can be
>>>> appropriately managed. But once this is done, the resulting high order
>>>> chunks can be freed as a unit to the pcp or buddy.
>>>>
>>>> This significiantly speeds up the free operation but also has the side
>>>> benefit that high order blocks are added to the pcp instead of each page
>>>> ending up on the pcp order-0 list; memory remains more readily available
>>>> in high orders.
>>>>
>>>> vmalloc will shortly become a user of this new optimized
>>>> free_contig_range() since it agressively allocates high order
>>>> non-compound pages, but then calls split_page() to end up with
>>>> contiguous order-0 pages. These can now be freed much more efficiently.
>>>>
>>>> The execution time of the following function was measured in a VM on an
>>>> Apple M2 system:
>>>>
>>>> static int page_alloc_high_ordr_test(void)
>>>> {
>>>> 	unsigned int order = HPAGE_PMD_ORDER;
>>>> 	struct page *page;
>>>> 	int i;
>>>>
>>>> 	for (i = 0; i < 100000; i++) {
>>>> 		page = alloc_pages(GFP_KERNEL, order);
>>>> 		if (!page)
>>>> 			return -1;
>>>> 		split_page(page, order);
>>>> 		free_contig_range(page_to_pfn(page), 1UL << order);
>>>> 	}
>>>>
>>>> 	return 0;
>>>> }
>>>>
>>>> Execution time before: 1684366 usec
>>>> Execution time after:   136216 usec
>>>>
>>>> Perf trace before:
>>>>
>>>>     60.93%     0.00%  kthreadd     [kernel.kallsyms]      [k] ret_from_fork
>>>>             |
>>>>             ---ret_from_fork
>>>>                kthread
>>>>                0xffffbba283e63980
>>>>                |
>>>>                |--60.01%--0xffffbba283e636dc
>>>>                |          |
>>>>                |          |--58.57%--free_contig_range
>>>>                |          |          |
>>>>                |          |          |--57.19%--___free_pages
>>>>                |          |          |          |
>>>>                |          |          |          |--46.65%--__free_frozen_pages
>>>>                |          |          |          |          |
>>>>                |          |          |          |          |--28.08%--free_pcppages_bulk
>>>>                |          |          |          |          |
>>>>                |          |          |          |           --12.05%--free_frozen_page_commit.constprop.0
>>>>                |          |          |          |
>>>>                |          |          |          |--5.10%--__get_pfnblock_flags_mask.isra.0
>>>>                |          |          |          |
>>>>                |          |          |          |--1.13%--_raw_spin_unlock
>>>>                |          |          |          |
>>>>                |          |          |          |--0.78%--free_frozen_page_commit.constprop.0
>>>>                |          |          |          |
>>>>                |          |          |           --0.75%--_raw_spin_trylock
>>>>                |          |          |
>>>>                |          |           --0.95%--__free_frozen_pages
>>>>                |          |
>>>>                |           --1.44%--___free_pages
>>>>                |
>>>>                 --0.78%--0xffffbba283e636c0
>>>>                           split_page
>>>>
>>>> Perf trace after:
>>>>
>>>>     10.62%     0.00%  kthreadd     [kernel.kallsyms]  [k] ret_from_fork
>>>>             |
>>>>             ---ret_from_fork
>>>>                kthread
>>>>                0xffffbbd55ef74980
>>>>                |
>>>>                |--8.74%--0xffffbbd55ef746dc
>>>>                |          free_contig_range
>>>>                |          |
>>>>                |           --8.72%--__free_contig_range
>>>>                |
>>>>                 --1.56%--0xffffbbd55ef746c0
>>>>                           |
>>>>                            --1.54%--split_page
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@....com>
>>>> ---
>>>>  include/linux/gfp.h |   1 +
>>>>  mm/page_alloc.c     | 116 +++++++++++++++++++++++++++++++++++++++-----
>>>>  2 files changed, 106 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>>> index b155929af5b1..3ed0bef34d0c 100644
>>>> --- a/include/linux/gfp.h
>>>> +++ b/include/linux/gfp.h
>>>> @@ -439,6 +439,7 @@ extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_
>>>>  #define alloc_contig_pages(...)			alloc_hooks(alloc_contig_pages_noprof(__VA_ARGS__))
>>>>
>>>>  #endif
>>>> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages);
>>>>  void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>>>>
>>>>  #ifdef CONFIG_CONTIG_ALLOC
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index a045d728ae0f..1015c8edf8a4 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -91,6 +91,9 @@ typedef int __bitwise fpi_t;
>>>>  /* Free the page without taking locks. Rely on trylock only. */
>>>>  #define FPI_TRYLOCK		((__force fpi_t)BIT(2))
>>>>
>>>> +/* free_pages_prepare() has already been called for page(s) being freed. */
>>>> +#define FPI_PREPARED		((__force fpi_t)BIT(3))
>>>> +
>>>>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>>>>  static DEFINE_MUTEX(pcp_batch_high_lock);
>>>>  #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
>>>> @@ -1582,8 +1585,12 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>>>>  	unsigned long pfn = page_to_pfn(page);
>>>>  	struct zone *zone = page_zone(page);
>>>>
>>>> -	if (free_pages_prepare(page, order))
>>>> -		free_one_page(zone, page, pfn, order, fpi_flags);
>>>> +	if (!(fpi_flags & FPI_PREPARED)) {
>>>> +		if (!free_pages_prepare(page, order))
>>>> +			return;
>>>> +	}
>>>> +
>>>> +	free_one_page(zone, page, pfn, order, fpi_flags);
>>>>  }
>>>>
>>>>  void __meminit __free_pages_core(struct page *page, unsigned int order,
>>>> @@ -2943,8 +2950,10 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
>>>>  		return;
>>>>  	}
>>>>
>>>> -	if (!free_pages_prepare(page, order))
>>>> -		return;
>>>> +	if (!(fpi_flags & FPI_PREPARED)) {
>>>> +		if (!free_pages_prepare(page, order))
>>>> +			return;
>>>> +	}
>>>>
>>>>  	/*
>>>>  	 * We only track unmovable, reclaimable and movable on pcp lists.
>>>> @@ -7250,9 +7259,99 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
>>>>  }
>>>>  #endif /* CONFIG_CONTIG_ALLOC */
>>>>
>>>> +static void free_prepared_contig_range(struct page *page,
>>>> +				       unsigned long nr_pages)
>>>> +{
>>>> +	while (nr_pages) {
>>>> +		unsigned int fit_order, align_order, order;
>>>> +		unsigned long pfn;
>>>> +
>>>> +		/*
>>>> +		 * Find the largest aligned power-of-2 number of pages that
>>>> +		 * starts at the current page, does not exceed nr_pages and is
>>>> +		 * less than or equal to pageblock_order.
>>>> +		 */
>>>> +		pfn = page_to_pfn(page);
>>>> +		fit_order = ilog2(nr_pages);
>>>> +		align_order = pfn ? __ffs(pfn) : fit_order;
>>>> +		order = min3(fit_order, align_order, pageblock_order);
>>>> +
>>>> +		/*
>>>> +		 * Free the chunk as a single block. Our caller has already
>>>> +		 * called free_pages_prepare() for each order-0 page.
>>>> +		 */
>>>> +		__free_frozen_pages(page, order, FPI_PREPARED);
>>>> +
>>>> +		page += 1UL << order;
>>>> +		nr_pages -= 1UL << order;
>>>> +	}
>>>> +}
>>>> +
>>>> +/**
>>>> + * __free_contig_range - Free contiguous range of order-0 pages.
>>>> + * @pfn: Page frame number of the first page in the range.
>>>> + * @nr_pages: Number of pages to free.
>>>> + *
>>>> + * For each order-0 struct page in the physically contiguous range, put a
>>>> + * reference. Free any page who's reference count falls to zero. The
>>>> + * implementation is functionally equivalent to, but significantly faster than
>>>> + * calling __free_page() for each struct page in a loop.
>>>> + *
>>>> + * Memory allocated with alloc_pages(order>=1) then subsequently split to
>>>> + * order-0 with split_page() is an example of appropriate contiguous pages that
>>>> + * can be freed with this API.
>>>> + *
>>>> + * Returns the number of pages which were not freed, because their reference
>>>> + * count did not fall to zero.
>>>> + *
>>>> + * Context: May be called in interrupt context or while holding a normal
>>>> + * spinlock, but not in NMI context or while holding a raw spinlock.
>>>> + */
>>>> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages)
>>>> +{
>>>> +	struct page *page = pfn_to_page(pfn);
>>>> +	unsigned long not_freed = 0;
>>>> +	struct page *start = NULL;
>>>> +	unsigned long i;
>>>> +	bool can_free;
>>>> +
>>>> +	/*
>>>> +	 * Chunk the range into contiguous runs of pages for which the refcount
>>>> +	 * went to zero and for which free_pages_prepare() succeeded. If
>>>> +	 * free_pages_prepare() fails we consider the page to have been freed
>>>> +	 * deliberately leak it.
>>>> +	 *
>>>> +	 * Code assumes contiguous PFNs have contiguous struct pages, but not
>>>> +	 * vice versa.
>>>> +	 */
>>>> +	for (i = 0; i < nr_pages; i++, page++) {
>>>> +		VM_BUG_ON_PAGE(PageHead(page), page);
>>>> +		VM_BUG_ON_PAGE(PageTail(page), page);
>>>> +
>>>> +		can_free = put_page_testzero(page);
>>>> +		if (!can_free)
>>>> +			not_freed++;
>>>> +		else if (!free_pages_prepare(page, 0))
>>>> +			can_free = false;
>>>
>>> I understand you use free_pages_prepare() here to catch early failures.
>>> I wonder if we could let __free_frozen_pages() handle the failure of
>>> non-compound >0 order pages instead of a new FPI flag.
>>
>> I'm not sure I follow. You would still need to provide a flag to
>> __free_frozen_pages() to tell it "this is a set of order-0 pages". Otherwise it
>> will treat it as a non-compound high order page, which would be wrong;
>> free_pages_prepare() would only be called for the head page (with the order
>> passed in) and that won't do the right thing.
>>
>> I guess you could pass the flag all the way to free_pages_prepare() then it
>> could be modified to do the right thing for contiguous order-0 pages; that would
>> probably ultimately be more efficient then calling free_pages_prepare() for
>> every order-0 page. Is that what you are suggesting?
> 
> Yes. I mistakenly mixed up non-compound high order page and a set of order-0
> pages. There is alloc_pages_bulk() to get a list of order-0 pages, but
> free_pages_bulk() does not exist. Maybe that is what we need here?

This is what I initially started with; vmalloc maintains a list of struct pages
so why not just pass that list to something like free_pages_bulk()? The problem
is that the optimization relies on a *contiguous* set of order-0 pages. A list
of pointers to pages does not imply they must be contiguous. So I don't think
it's the right API.

We already have free_contig_range() which takes a range of PFNs. That's the
exact semantic we can optimize so surely that's the better API style? Hence I
added __free_contig_range() and reimplemented free_contig_range() (which does
more than vfree wants()) on top.

> Using __free_frozen_pages() for a set of order-0 pages looks like a
> shoehorning. 

Perhaps; that's an internal function so could rename to __free_frozen_order(),
passing in a PFN and an order and teach it to handle all 3 cases internally:

 - high-order non-compound page (already supports)
 - high-order compound page (already supports)
 - power-of-2 sized and aligned number of contiguos order-0 pages (new)

> I admit that adding free_pages_bulk() with maximal code
> reuse and a good interface will take some effort, so it probably is a long
> term goal. free_pages_bulk() is also slightly different from what you
> want to do, since, if it uses same interface as alloc_pages_bulk(),
> it will need to accept a page array instead of page + order.

Yeah; I don't think that's a good API for this case. We would spend more effort
looking for contiguous pages when there are none. (for the vfree case it's
highly likely that they are contiguous because that's how they were allocated).

> 
> I am not suggesting you should do this, but just think out loud.
> 
>>
>>>
>>> Looking at free_pages_prepare(), three cases would cause failures:
>>> 1. PageHWPoison(page): the code excludes >0 order pages, so it needs
>>>    to be fixed. BTW, Jiaqi Yan has a series trying to tackle it[1].
>>>
>>> 2. uncleared PageNetpp(page): probably need to check every individual
>>>    page of this >0 order page and call bad_page() for any violator.
>>>
>>> 3. bad free page: probably need to do it for individual page as well.
>>
>> It's not just handling the failures, it's accounting; e.g.
>> __memcg_kmem_uncharge_page().
> 
> Got it. Another idea comes to mind.
> 
> Is it doable to
> 1) use put_page_testzero() to bring all pages’ refs to 0,
> 2) unsplit/merge these contiguous order-0 pages back to non-compound
>    high order pages,
> 3) free unsplit/merged pages with __free_frozen_pages()?

Yes, I thought about this approach. I think it would work, but I assumed it
would be extra effort to merge the pages only to then free them. It might be in
the noise though. Do you think this approach is better than what I suggest
above? If so, I'll have a go.

> 
> Since your example is 1) allocate a non compound high order page,
> 2) split_page(). The above approach is doing the reverse steps.
> Does your example represent the actual use cases?

Yes; vmalloc() will allocate the highest orders it can then call split_page()
because there are vmalloc memory users that expect to be able to manipulat the
memory as order-0 pages.

Thanks,
Ryan

> 
> 
> Best Regards,
> Yan, Zi