linux-kernel - Re: [PATCH v1 1/2] mm/page_alloc: Optimize free_contig

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <21E14DB6-DA70-4F7F-8482-12DFED81153D@nvidia.com>
Date: Mon, 12 Jan 2026 10:57:46 -0500
From: Zi Yan <ziy@...dia.com>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
 David Hildenbrand <david@...nel.org>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>,
 Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
 Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
 Brendan Jackman <jackmanb@...gle.com>, Johannes Weiner <hannes@...xchg.org>,
 Uladzislau Rezki <urezki@...il.com>,
 "Vishal Moola (Oracle)" <vishal.moola@...il.com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, Jiaqi Yan <jiaqiyan@...gle.com>
Subject: Re: [PATCH v1 1/2] mm/page_alloc: Optimize free_contig_range()

On 12 Jan 2026, at 8:24, Ryan Roberts wrote:

> Hi Zi,
>
> Sorry for slow response - I had some other high priority stuff come in...
>
>
> On 07/01/2026 03:32, Zi Yan wrote:
>> On 5 Jan 2026, at 12:31, Ryan Roberts wrote:
>>
>>> On 05/01/2026 17:15, Zi Yan wrote:
>>>> On 5 Jan 2026, at 11:17, Ryan Roberts wrote:
>>>>
>>>>> Decompose the range of order-0 pages to be freed into the set of largest
>>>>> possible power-of-2 size and aligned chunks and free them to the pcp or
>>>>> buddy. This improves on the previous approach which freed each order-0
>>>>> page individually in a loop. Testing shows performance to be improved by
>>>>> more than 10x in some cases.
>>>>>
>>>>> Since each page is order-0, we must decrement each page's reference
>>>>> count individually and only consider the page for freeing as part of a
>>>>> high order chunk if the reference count goes to zero. Additionally
>>>>> free_pages_prepare() must be called for each individual order-0 page
>>>>> too, so that the struct page state and global accounting state can be
>>>>> appropriately managed. But once this is done, the resulting high order
>>>>> chunks can be freed as a unit to the pcp or buddy.
>>>>>
>>>>> This significiantly speeds up the free operation but also has the side
>>>>> benefit that high order blocks are added to the pcp instead of each page
>>>>> ending up on the pcp order-0 list; memory remains more readily available
>>>>> in high orders.
>>>>>
>>>>> vmalloc will shortly become a user of this new optimized
>>>>> free_contig_range() since it agressively allocates high order
>>>>> non-compound pages, but then calls split_page() to end up with
>>>>> contiguous order-0 pages. These can now be freed much more efficiently.
>>>>>
>>>>> The execution time of the following function was measured in a VM on an
>>>>> Apple M2 system:
>>>>>
>>>>> static int page_alloc_high_ordr_test(void)
>>>>> {
>>>>> 	unsigned int order = HPAGE_PMD_ORDER;
>>>>> 	struct page *page;
>>>>> 	int i;
>>>>>
>>>>> 	for (i = 0; i < 100000; i++) {
>>>>> 		page = alloc_pages(GFP_KERNEL, order);
>>>>> 		if (!page)
>>>>> 			return -1;
>>>>> 		split_page(page, order);
>>>>> 		free_contig_range(page_to_pfn(page), 1UL << order);
>>>>> 	}
>>>>>
>>>>> 	return 0;
>>>>> }
>>>>>
>>>>> Execution time before: 1684366 usec
>>>>> Execution time after:   136216 usec
>>>>>
>>>>> Perf trace before:
>>>>>
>>>>>     60.93%     0.00%  kthreadd     [kernel.kallsyms]      [k] ret_from_fork
>>>>>             |
>>>>>             ---ret_from_fork
>>>>>                kthread
>>>>>                0xffffbba283e63980
>>>>>                |
>>>>>                |--60.01%--0xffffbba283e636dc
>>>>>                |          |
>>>>>                |          |--58.57%--free_contig_range
>>>>>                |          |          |
>>>>>                |          |          |--57.19%--___free_pages
>>>>>                |          |          |          |
>>>>>                |          |          |          |--46.65%--__free_frozen_pages
>>>>>                |          |          |          |          |
>>>>>                |          |          |          |          |--28.08%--free_pcppages_bulk
>>>>>                |          |          |          |          |
>>>>>                |          |          |          |           --12.05%--free_frozen_page_commit.constprop.0
>>>>>                |          |          |          |
>>>>>                |          |          |          |--5.10%--__get_pfnblock_flags_mask.isra.0
>>>>>                |          |          |          |
>>>>>                |          |          |          |--1.13%--_raw_spin_unlock
>>>>>                |          |          |          |
>>>>>                |          |          |          |--0.78%--free_frozen_page_commit.constprop.0
>>>>>                |          |          |          |
>>>>>                |          |          |           --0.75%--_raw_spin_trylock
>>>>>                |          |          |
>>>>>                |          |           --0.95%--__free_frozen_pages
>>>>>                |          |
>>>>>                |           --1.44%--___free_pages
>>>>>                |
>>>>>                 --0.78%--0xffffbba283e636c0
>>>>>                           split_page
>>>>>
>>>>> Perf trace after:
>>>>>
>>>>>     10.62%     0.00%  kthreadd     [kernel.kallsyms]  [k] ret_from_fork
>>>>>             |
>>>>>             ---ret_from_fork
>>>>>                kthread
>>>>>                0xffffbbd55ef74980
>>>>>                |
>>>>>                |--8.74%--0xffffbbd55ef746dc
>>>>>                |          free_contig_range
>>>>>                |          |
>>>>>                |           --8.72%--__free_contig_range
>>>>>                |
>>>>>                 --1.56%--0xffffbbd55ef746c0
>>>>>                           |
>>>>>                            --1.54%--split_page
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@....com>
>>>>> ---
>>>>>  include/linux/gfp.h |   1 +
>>>>>  mm/page_alloc.c     | 116 +++++++++++++++++++++++++++++++++++++++-----
>>>>>  2 files changed, 106 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>>>> index b155929af5b1..3ed0bef34d0c 100644
>>>>> --- a/include/linux/gfp.h
>>>>> +++ b/include/linux/gfp.h
>>>>> @@ -439,6 +439,7 @@ extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_
>>>>>  #define alloc_contig_pages(...)			alloc_hooks(alloc_contig_pages_noprof(__VA_ARGS__))
>>>>>
>>>>>  #endif
>>>>> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages);
>>>>>  void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>>>>>
>>>>>  #ifdef CONFIG_CONTIG_ALLOC
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index a045d728ae0f..1015c8edf8a4 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -91,6 +91,9 @@ typedef int __bitwise fpi_t;
>>>>>  /* Free the page without taking locks. Rely on trylock only. */
>>>>>  #define FPI_TRYLOCK		((__force fpi_t)BIT(2))
>>>>>
>>>>> +/* free_pages_prepare() has already been called for page(s) being freed. */
>>>>> +#define FPI_PREPARED		((__force fpi_t)BIT(3))
>>>>> +
>>>>>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>>>>>  static DEFINE_MUTEX(pcp_batch_high_lock);
>>>>>  #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
>>>>> @@ -1582,8 +1585,12 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>>>>>  	unsigned long pfn = page_to_pfn(page);
>>>>>  	struct zone *zone = page_zone(page);
>>>>>
>>>>> -	if (free_pages_prepare(page, order))
>>>>> -		free_one_page(zone, page, pfn, order, fpi_flags);
>>>>> +	if (!(fpi_flags & FPI_PREPARED)) {
>>>>> +		if (!free_pages_prepare(page, order))
>>>>> +			return;
>>>>> +	}
>>>>> +
>>>>> +	free_one_page(zone, page, pfn, order, fpi_flags);
>>>>>  }
>>>>>
>>>>>  void __meminit __free_pages_core(struct page *page, unsigned int order,
>>>>> @@ -2943,8 +2950,10 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
>>>>>  		return;
>>>>>  	}
>>>>>
>>>>> -	if (!free_pages_prepare(page, order))
>>>>> -		return;
>>>>> +	if (!(fpi_flags & FPI_PREPARED)) {
>>>>> +		if (!free_pages_prepare(page, order))
>>>>> +			return;
>>>>> +	}
>>>>>
>>>>>  	/*
>>>>>  	 * We only track unmovable, reclaimable and movable on pcp lists.
>>>>> @@ -7250,9 +7259,99 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
>>>>>  }
>>>>>  #endif /* CONFIG_CONTIG_ALLOC */
>>>>>
>>>>> +static void free_prepared_contig_range(struct page *page,
>>>>> +				       unsigned long nr_pages)
>>>>> +{
>>>>> +	while (nr_pages) {
>>>>> +		unsigned int fit_order, align_order, order;
>>>>> +		unsigned long pfn;
>>>>> +
>>>>> +		/*
>>>>> +		 * Find the largest aligned power-of-2 number of pages that
>>>>> +		 * starts at the current page, does not exceed nr_pages and is
>>>>> +		 * less than or equal to pageblock_order.
>>>>> +		 */
>>>>> +		pfn = page_to_pfn(page);
>>>>> +		fit_order = ilog2(nr_pages);
>>>>> +		align_order = pfn ? __ffs(pfn) : fit_order;
>>>>> +		order = min3(fit_order, align_order, pageblock_order);
>>>>> +
>>>>> +		/*
>>>>> +		 * Free the chunk as a single block. Our caller has already
>>>>> +		 * called free_pages_prepare() for each order-0 page.
>>>>> +		 */
>>>>> +		__free_frozen_pages(page, order, FPI_PREPARED);
>>>>> +
>>>>> +		page += 1UL << order;
>>>>> +		nr_pages -= 1UL << order;
>>>>> +	}
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * __free_contig_range - Free contiguous range of order-0 pages.
>>>>> + * @pfn: Page frame number of the first page in the range.
>>>>> + * @nr_pages: Number of pages to free.
>>>>> + *
>>>>> + * For each order-0 struct page in the physically contiguous range, put a
>>>>> + * reference. Free any page who's reference count falls to zero. The
>>>>> + * implementation is functionally equivalent to, but significantly faster than
>>>>> + * calling __free_page() for each struct page in a loop.
>>>>> + *
>>>>> + * Memory allocated with alloc_pages(order>=1) then subsequently split to
>>>>> + * order-0 with split_page() is an example of appropriate contiguous pages that
>>>>> + * can be freed with this API.
>>>>> + *
>>>>> + * Returns the number of pages which were not freed, because their reference
>>>>> + * count did not fall to zero.
>>>>> + *
>>>>> + * Context: May be called in interrupt context or while holding a normal
>>>>> + * spinlock, but not in NMI context or while holding a raw spinlock.
>>>>> + */
>>>>> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages)
>>>>> +{
>>>>> +	struct page *page = pfn_to_page(pfn);
>>>>> +	unsigned long not_freed = 0;
>>>>> +	struct page *start = NULL;
>>>>> +	unsigned long i;
>>>>> +	bool can_free;
>>>>> +
>>>>> +	/*
>>>>> +	 * Chunk the range into contiguous runs of pages for which the refcount
>>>>> +	 * went to zero and for which free_pages_prepare() succeeded. If
>>>>> +	 * free_pages_prepare() fails we consider the page to have been freed
>>>>> +	 * deliberately leak it.
>>>>> +	 *
>>>>> +	 * Code assumes contiguous PFNs have contiguous struct pages, but not
>>>>> +	 * vice versa.
>>>>> +	 */
>>>>> +	for (i = 0; i < nr_pages; i++, page++) {
>>>>> +		VM_BUG_ON_PAGE(PageHead(page), page);
>>>>> +		VM_BUG_ON_PAGE(PageTail(page), page);
>>>>> +
>>>>> +		can_free = put_page_testzero(page);
>>>>> +		if (!can_free)
>>>>> +			not_freed++;
>>>>> +		else if (!free_pages_prepare(page, 0))
>>>>> +			can_free = false;
>>>>
>>>> I understand you use free_pages_prepare() here to catch early failures.
>>>> I wonder if we could let __free_frozen_pages() handle the failure of
>>>> non-compound >0 order pages instead of a new FPI flag.
>>>
>>> I'm not sure I follow. You would still need to provide a flag to
>>> __free_frozen_pages() to tell it "this is a set of order-0 pages". Otherwise it
>>> will treat it as a non-compound high order page, which would be wrong;
>>> free_pages_prepare() would only be called for the head page (with the order
>>> passed in) and that won't do the right thing.
>>>
>>> I guess you could pass the flag all the way to free_pages_prepare() then it
>>> could be modified to do the right thing for contiguous order-0 pages; that would
>>> probably ultimately be more efficient then calling free_pages_prepare() for
>>> every order-0 page. Is that what you are suggesting?
>>
>> Yes. I mistakenly mixed up non-compound high order page and a set of order-0
>> pages. There is alloc_pages_bulk() to get a list of order-0 pages, but
>> free_pages_bulk() does not exist. Maybe that is what we need here?
>
> This is what I initially started with; vmalloc maintains a list of struct pages
> so why not just pass that list to something like free_pages_bulk()? The problem
> is that the optimization relies on a *contiguous* set of order-0 pages. A list
> of pointers to pages does not imply they must be contiguous. So I don't think
> it's the right API.
>
> We already have free_contig_range() which takes a range of PFNs. That's the
> exact semantic we can optimize so surely that's the better API style? Hence I
> added __free_contig_range() and reimplemented free_contig_range() (which does
> more than vfree wants()) on top.
>
>> Using __free_frozen_pages() for a set of order-0 pages looks like a
>> shoehorning.
>
> Perhaps; that's an internal function so could rename to __free_frozen_order(),
> passing in a PFN and an order and teach it to handle all 3 cases internally:
>
>  - high-order non-compound page (already supports)
>  - high-order compound page (already supports)
>  - power-of-2 sized and aligned number of contiguos order-0 pages (new)
>
>> I admit that adding free_pages_bulk() with maximal code
>> reuse and a good interface will take some effort, so it probably is a long
>> term goal. free_pages_bulk() is also slightly different from what you
>> want to do, since, if it uses same interface as alloc_pages_bulk(),
>> it will need to accept a page array instead of page + order.
>
> Yeah; I don't think that's a good API for this case. We would spend more effort
> looking for contiguous pages when there are none. (for the vfree case it's
> highly likely that they are contiguous because that's how they were allocated).
>

All above makes sense to me.

>>
>> I am not suggesting you should do this, but just think out loud.
>>
>>>
>>>>
>>>> Looking at free_pages_prepare(), three cases would cause failures:
>>>> 1. PageHWPoison(page): the code excludes >0 order pages, so it needs
>>>>    to be fixed. BTW, Jiaqi Yan has a series trying to tackle it[1].
>>>>
>>>> 2. uncleared PageNetpp(page): probably need to check every individual
>>>>    page of this >0 order page and call bad_page() for any violator.
>>>>
>>>> 3. bad free page: probably need to do it for individual page as well.
>>>
>>> It's not just handling the failures, it's accounting; e.g.
>>> __memcg_kmem_uncharge_page().
>>
>> Got it. Another idea comes to mind.
>>
>> Is it doable to
>> 1) use put_page_testzero() to bring all pages’ refs to 0,
>> 2) unsplit/merge these contiguous order-0 pages back to non-compound
>>    high order pages,
>> 3) free unsplit/merged pages with __free_frozen_pages()?
>
> Yes, I thought about this approach. I think it would work, but I assumed it
> would be extra effort to merge the pages only to then free them. It might be in
> the noise though. Do you think this approach is better than what I suggest
> above? If so, I'll have a go.

Yes, the symmetry also makes it easier to follow. Thanks.

>
>>
>> Since your example is 1) allocate a non compound high order page,
>> 2) split_page(). The above approach is doing the reverse steps.
>> Does your example represent the actual use cases?
>
> Yes; vmalloc() will allocate the highest orders it can then call split_page()
> because there are vmalloc memory users that expect to be able to manipulat the
> memory as order-0 pages.

Got it. Thank you for the explanation.

Best Regards,
Yan, Zi