linux-kernel - Re: [PATCH v3] mm: alloc_pages_bulk: support both simple and full-featured API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250420112110.GA32613@unreal>
Date: Sun, 20 Apr 2025 14:21:10 +0300
From: Leon Romanovsky <leon@...nel.org>
To: Yunsheng Lin <linyunsheng@...wei.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	Yishai Hadas <yishaih@...dia.com>, Jason Gunthorpe <jgg@...pe.ca>,
	Shameer Kolothum <shameerali.kolothum.thodi@...wei.com>,
	Kevin Tian <kevin.tian@...el.com>,
	Alex Williamson <alex.williamson@...hat.com>,
	Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>,
	David Sterba <dsterba@...e.com>, Gao Xiang <xiang@...nel.org>,
	Chao Yu <chao@...nel.org>, Yue Hu <zbestahu@...il.com>,
	Jeffle Xu <jefflexu@...ux.alibaba.com>,
	Sandeep Dhavale <dhavale@...gle.com>,
	Chuck Lever <chuck.lever@...cle.com>,
	Jeff Layton <jlayton@...nel.org>, Neil Brown <neilb@...e.de>,
	Olga Kornievskaia <okorniev@...hat.com>,
	Dai Ngo <Dai.Ngo@...cle.com>, Tom Talpey <tom@...pey.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Masami Hiramatsu <mhiramat@...nel.org>,
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
	Alexei Starovoitov <ast@...nel.org>,
	Daniel Borkmann <daniel@...earbox.net>,
	Andrii Nakryiko <andrii@...nel.org>,
	Martin KaFai Lau <martin.lau@...ux.dev>,
	Eduard Zingerman <eddyz87@...il.com>, Song Liu <song@...nel.org>,
	Yonghong Song <yonghong.song@...ux.dev>,
	John Fastabend <john.fastabend@...il.com>,
	KP Singh <kpsingh@...nel.org>, Stanislav Fomichev <sdf@...ichev.me>,
	Hao Luo <haoluo@...gle.com>, Jiri Olsa <jolsa@...nel.org>,
	Jesper Dangaard Brouer <hawk@...nel.org>,
	Ilias Apalodimas <ilias.apalodimas@...aro.org>,
	"David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
	Simon Horman <horms@...nel.org>,
	Trond Myklebust <trondmy@...nel.org>,
	Anna Schumaker <anna@...nel.org>,
	Luiz Capitulino <luizcap@...hat.com>,
	Mel Gorman <mgorman@...hsingularity.net>, kvm@...r.kernel.org,
	linux-kernel@...r.kernel.org, virtualization@...ts.linux.dev,
	linux-btrfs@...r.kernel.org, linux-erofs@...ts.ozlabs.org,
	linux-mm@...ck.org, linux-nfs@...r.kernel.org,
	linux-trace-kernel@...r.kernel.org, bpf@...r.kernel.org,
	netdev@...r.kernel.org
Subject: Re: [PATCH v3] mm: alloc_pages_bulk: support both simple and
 full-featured API

On Mon, Apr 14, 2025 at 08:08:11PM +0800, Yunsheng Lin wrote:
> As mentioned in [1], it seems odd to check NULL elements in
> the middle of page bulk allocating, and it seems caller can
> do a better job of bulk allocating pages into a whole array
> sequentially without checking NULL elements first before
> doing the page bulk allocation for most of existing users
> by passing 'page_array + allocated' and 'nr_pages - allocated'
> when calling subsequent page bulk alloc API so that NULL
> checking can be avoided, see the pattern in mm/mempolicy.c.
> 
> Through analyzing of existing bulk allocation API users, it
> seems only the fs users are depending on the assumption of
> populating only NULL elements, see:
> commit 91d6ac1d62c3 ("btrfs: allocate page arrays using bulk page allocator")
> commit d6db47e571dc ("erofs: do not use pagepool in z_erofs_gbuf_growsize()")
> commit f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
> commit 88e4d41a264d ("SUNRPC: Use __alloc_bulk_pages() in svc_init_buffer()")
> 
> The current API adds a mental burden for most users. For most
> users, their code would be much cleaner if the interface accepts
> an uninitialised array with length, and were told how many pages
> had been stored in that array, so support one simple and one
> full-featured to meet the above different use cases as below:
> - alloc_pages_bulk() would be given an uninitialised array of page
>   pointers and a required count and would return the number of
>   pages that were allocated.
> - alloc_pages_bulk_refill() would be given an initialised array
>   of page pointers some of which might be NULL. It would attempt
>   to allocate pages for the non-NULL pointers, return 0 if all
>   pages are allocated, -EAGAIN if at least one page allocated,
>   ok to try again immediately or -ENOMEM if don't bother trying
>   again soon, which provides a more consistent semantics than the
>   current API as mentioned in [2], at the cost of the pages might
>   be getting re-ordered to make the implementation simpler.
> 
> Change the existing fs users to use the full-featured API, except
> for the one for svc_init_buffer() in net/sunrpc/svc.c. Other
> existing callers can use the simple API as they seems to be passing
> all NULL elements via memset, kzalloc, etc, only remove unnecessary
> memset for existing users calling the simple API in this patch.
> 
> The test result for xfstests full test:
> Before this patch:
> btrfs/default: 1061 tests, 3 failures, 290 skipped, 13152 seconds
>   Failures: btrfs/012 btrfs/226
>   Flaky: generic/301: 60% (3/5)
> Totals: 1073 tests, 290 skipped, 13 failures, 0 errors, 12540s
> 
> nfs/loopback: 530 tests, 3 failures, 392 skipped, 3942 seconds
>   Failures: generic/464 generic/551
>   Flaky: generic/650: 40% (2/5)
> Totals: 542 tests, 392 skipped, 12 failures, 0 errors, 3799s
> 
> After this patch:
> btrfs/default: 1061 tests, 2 failures, 290 skipped, 13446 seconds
>   Failures: btrfs/012 btrfs/226
> Totals: 1069 tests, 290 skipped, 10 failures, 0 errors, 12853s
> 
> nfs/loopback: 530 tests, 3 failures, 392 skipped, 4103 seconds
>   Failures: generic/464 generic/551
>   Flaky: generic/650: 60% (3/5)
> Totals: 542 tests, 392 skipped, 13 failures, 0 errors, 3933s
> 
> The stress test also suggest there is no regression for the erofs
> too.
> 
> Using the simple API also enable the caller to not zero the array
> before calling the page bulk allocating API, which has about 1~2 ns
> performance improvement for time_bench_page_pool03_slow() test case
> of page_pool in a x86 vm system, this reduces some performance impact
> of fixing the DMA API misuse problem in [3], performance improves
> from 87.886 ns to 86.429 ns.
> 
> Also a temporary patch to enable the using of full-featured API in
> page_pool suggests that the new full-featured API doesn't seem to have
> noticeable performance impact for the existing users, like SUNRPC, btrfs
> and erofs.
> 
> 1. https://lore.kernel.org/all/bd8c2f5c-464d-44ab-b607-390a87ea4cd5@huawei.com/
> 2. https://lore.kernel.org/all/180818a1-b906-4a0b-89d3-34cb71cc26c9@huawei.com/
> 3. https://lore.kernel.org/all/20250212092552.1779679-1-linyunsheng@huawei.com/
> CC: Jesper Dangaard Brouer <hawk@...nel.org>
> CC: Luiz Capitulino <luizcap@...hat.com>
> CC: Mel Gorman <mgorman@...hsingularity.net>
> Suggested-by: Neil Brown <neilb@...e.de>
> Acked-by: Jeff Layton <jlayton@...nel.org>
> Signed-off-by: Yunsheng Lin <linyunsheng@...wei.com>
> ---
> V3:
> 1. Provide both simple and full-featured API as suggested by NeilBrown.
> 2. Do the fs testing as suggested in V2.
> 
> V2:
> 1. Drop RFC tag.
> 2. Fix a compile error for xfs.
> 3. Defragmemt the page_array for SUNRPC and btrfs.
> ---
>  drivers/vfio/pci/mlx5/cmd.c       |  2 --
>  drivers/vfio/pci/virtio/migrate.c |  2 --
>  fs/btrfs/extent_io.c              | 21 +++++++++---------
>  fs/erofs/zutil.c                  | 11 +++++----
>  include/linux/gfp.h               | 37 +++++++++++++++++++++++++++++++
>  include/trace/events/sunrpc.h     | 12 +++++-----
>  kernel/bpf/arena.c                |  1 -
>  mm/page_alloc.c                   | 32 +++++---------------------
>  net/core/page_pool.c              |  3 ---
>  net/sunrpc/svc_xprt.c             | 12 ++++++----
>  10 files changed, 72 insertions(+), 61 deletions(-)
> 
> diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
> index 11eda6b207f1..fb094527715f 100644
> --- a/drivers/vfio/pci/mlx5/cmd.c
> +++ b/drivers/vfio/pci/mlx5/cmd.c
> @@ -446,8 +446,6 @@ static int mlx5vf_add_migration_pages(struct mlx5_vhca_data_buffer *buf,
>  		if (ret)
>  			goto err_append;
>  		buf->allocated_length += filled * PAGE_SIZE;
> -		/* clean input for another bulk allocation */
> -		memset(page_list, 0, filled * sizeof(*page_list));
>  		to_fill = min_t(unsigned int, to_alloc,
>  				PAGE_SIZE / sizeof(*page_list));

If it is possible, let's drop this hunk to reduce merge conflicts.
The whole mlx5vf_add_migration_pages() is planned to be rewritten.
https://lore.kernel.org/linux-rdma/076a3991e663fe07c1a5395f5805c514b63e4d94.1744825142.git.leon@kernel.org/

Thanks


>  	} while (to_alloc > 0);
> diff --git a/drivers/vfio/pci/virtio/migrate.c b/drivers/vfio/pci/virtio/migrate.c
> index ba92bb4e9af9..9f003a237dec 100644
> --- a/drivers/vfio/pci/virtio/migrate.c
> +++ b/drivers/vfio/pci/virtio/migrate.c
> @@ -91,8 +91,6 @@ static int virtiovf_add_migration_pages(struct virtiovf_data_buffer *buf,
>  		if (ret)
>  			goto err_append;
>  		buf->allocated_length += filled * PAGE_SIZE;
> -		/* clean input for another bulk allocation */
> -		memset(page_list, 0, filled * sizeof(*page_list));
>  		to_fill = min_t(unsigned int, to_alloc,
>  				PAGE_SIZE / sizeof(*page_list));
>  	} while (to_alloc > 0);
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 197f5e51c474..51ef15703900 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -623,21 +623,22 @@ int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
>  			   bool nofail)
>  {
>  	const gfp_t gfp = nofail ? (GFP_NOFS | __GFP_NOFAIL) : GFP_NOFS;
> -	unsigned int allocated;
> -
> -	for (allocated = 0; allocated < nr_pages;) {
> -		unsigned int last = allocated;
> +	int ret;
>  
> -		allocated = alloc_pages_bulk(gfp, nr_pages, page_array);
> -		if (unlikely(allocated == last)) {
> +	do {
> +		ret = alloc_pages_bulk_refill(gfp, nr_pages, page_array);
> +		if (unlikely(ret == -ENOMEM)) {
>  			/* No progress, fail and do cleanup. */
> -			for (int i = 0; i < allocated; i++) {
> -				__free_page(page_array[i]);
> -				page_array[i] = NULL;
> +			for (int i = 0; i < nr_pages; i++) {
> +				if (page_array[i]) {
> +					__free_page(page_array[i]);
> +					page_array[i] = NULL;
> +				}
>  			}
>  			return -ENOMEM;
>  		}
> -	}
> +	} while (ret == -EAGAIN);
> +
>  	return 0;
>  }
>  
> diff --git a/fs/erofs/zutil.c b/fs/erofs/zutil.c
> index 55ff2ab5128e..6ce11a8a261c 100644
> --- a/fs/erofs/zutil.c
> +++ b/fs/erofs/zutil.c
> @@ -68,7 +68,7 @@ int z_erofs_gbuf_growsize(unsigned int nrpages)
>  	struct page **tmp_pages = NULL;
>  	struct z_erofs_gbuf *gbuf;
>  	void *ptr, *old_ptr;
> -	int last, i, j;
> +	int ret, i, j;
>  
>  	mutex_lock(&gbuf_resize_mutex);
>  	/* avoid shrinking gbufs, since no idea how many fses rely on */
> @@ -86,12 +86,11 @@ int z_erofs_gbuf_growsize(unsigned int nrpages)
>  		for (j = 0; j < gbuf->nrpages; ++j)
>  			tmp_pages[j] = gbuf->pages[j];
>  		do {
> -			last = j;
> -			j = alloc_pages_bulk(GFP_KERNEL, nrpages,
> -					     tmp_pages);
> -			if (last == j)
> +			ret = alloc_pages_bulk_refill(GFP_KERNEL, nrpages,
> +						      tmp_pages);
> +			if (ret == -ENOMEM)
>  				goto out;
> -		} while (j != nrpages);
> +		} while (ret == -EAGAIN);
>  
>  		ptr = vmap(tmp_pages, nrpages, VM_MAP, PAGE_KERNEL);
>  		if (!ptr)
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index c9fa6309c903..cf6100981fd6 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -244,6 +244,43 @@ unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp,
>  #define alloc_pages_bulk(_gfp, _nr_pages, _page_array)		\
>  	__alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, _page_array)
>  
> +/*
> + * alloc_pages_bulk_refill_noprof - Refill order-0 pages to an array
> + * @gfp: GFP flags for the allocation when refilling
> + * @nr_pages: The size of refilling array
> + * @page_array: The array to refill order-0 pages
> + *
> + * Note that only NULL elements are populated with pages and the pages might
> + * get re-ordered.
> + *
> + * Return 0 if all pages are refilled, -EAGAIN if at least one page is refilled,
> + * ok to try again immediately or -ENOMEM if no page is refilled and don't
> + * bother trying again soon.
> + */
> +static inline int alloc_pages_bulk_refill_noprof(gfp_t gfp, int nr_pages,
> +						 struct page **page_array)
> +{
> +	int allocated = 0, i;
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		if (page_array[i]) {
> +			swap(page_array[allocated], page_array[i]);
> +			allocated++;
> +		}
> +	}
> +
> +	i = alloc_pages_bulk_noprof(gfp, numa_mem_id(), NULL,
> +				    nr_pages - allocated,
> +				    page_array + allocated);
> +	if (likely(allocated + i == nr_pages))
> +		return 0;
> +
> +	return i ? -EAGAIN : -ENOMEM;
> +}
> +
> +#define alloc_pages_bulk_refill(...)				\
> +	alloc_hooks(alloc_pages_bulk_refill_noprof(__VA_ARGS__))
> +
>  static inline unsigned long
>  alloc_pages_bulk_node_noprof(gfp_t gfp, int nid, unsigned long nr_pages,
>  				   struct page **page_array)
> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> index 5d331383047b..cb8899f1cbdc 100644
> --- a/include/trace/events/sunrpc.h
> +++ b/include/trace/events/sunrpc.h
> @@ -2143,23 +2143,23 @@ TRACE_EVENT(svc_wake_up,
>  TRACE_EVENT(svc_alloc_arg_err,
>  	TP_PROTO(
>  		unsigned int requested,
> -		unsigned int allocated
> +		int ret
>  	),
>  
> -	TP_ARGS(requested, allocated),
> +	TP_ARGS(requested, ret),
>  
>  	TP_STRUCT__entry(
>  		__field(unsigned int, requested)
> -		__field(unsigned int, allocated)
> +		__field(int, ret)
>  	),
>  
>  	TP_fast_assign(
>  		__entry->requested = requested;
> -		__entry->allocated = allocated;
> +		__entry->ret = ret;
>  	),
>  
> -	TP_printk("requested=%u allocated=%u",
> -		__entry->requested, __entry->allocated)
> +	TP_printk("requested=%u ret=%d",
> +		__entry->requested, __entry->ret)
>  );
>  
>  DECLARE_EVENT_CLASS(svc_deferred_event,
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index 0d56cea71602..9022c4440814 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c
> @@ -445,7 +445,6 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
>  			return 0;
>  	}
>  
> -	/* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */
>  	pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL);
>  	if (!pages)
>  		return 0;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d7cfcfa2b077..59a4fe23e62a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4784,9 +4784,6 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
>   * This is a batched version of the page allocator that attempts to
>   * allocate nr_pages quickly. Pages are added to the page_array.
>   *
> - * Note that only NULL elements are populated with pages and nr_pages
> - * is the maximum number of pages that will be stored in the array.
> - *
>   * Returns the number of pages in the array.
>   */
>  unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
> @@ -4802,29 +4799,18 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  	struct alloc_context ac;
>  	gfp_t alloc_gfp;
>  	unsigned int alloc_flags = ALLOC_WMARK_LOW;
> -	int nr_populated = 0, nr_account = 0;
> -
> -	/*
> -	 * Skip populated array elements to determine if any pages need
> -	 * to be allocated before disabling IRQs.
> -	 */
> -	while (nr_populated < nr_pages && page_array[nr_populated])
> -		nr_populated++;
> +	int nr_populated = 0;
>  
>  	/* No pages requested? */
>  	if (unlikely(nr_pages <= 0))
>  		goto out;
>  
> -	/* Already populated array? */
> -	if (unlikely(nr_pages - nr_populated == 0))
> -		goto out;
> -
>  	/* Bulk allocator does not support memcg accounting. */
>  	if (memcg_kmem_online() && (gfp & __GFP_ACCOUNT))
>  		goto failed;
>  
>  	/* Use the single page allocator for one page. */
> -	if (nr_pages - nr_populated == 1)
> +	if (nr_pages == 1)
>  		goto failed;
>  
>  #ifdef CONFIG_PAGE_OWNER
> @@ -4896,24 +4882,16 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  	/* Attempt the batch allocation */
>  	pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
>  	while (nr_populated < nr_pages) {
> -
> -		/* Skip existing pages */
> -		if (page_array[nr_populated]) {
> -			nr_populated++;
> -			continue;
> -		}
> -
>  		page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
>  								pcp, pcp_list);
>  		if (unlikely(!page)) {
>  			/* Try and allocate at least one page */
> -			if (!nr_account) {
> +			if (!nr_populated) {
>  				pcp_spin_unlock(pcp);
>  				goto failed_irq;
>  			}
>  			break;
>  		}
> -		nr_account++;
>  
>  		prep_new_page(page, 0, gfp, 0);
>  		set_page_refcounted(page);
> @@ -4923,8 +4901,8 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  	pcp_spin_unlock(pcp);
>  	pcp_trylock_finish(UP_flags);
>  
> -	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
> -	zone_statistics(zonelist_zone(ac.preferred_zoneref), zone, nr_account);
> +	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_populated);
> +	zone_statistics(zonelist_zone(ac.preferred_zoneref), zone, nr_populated);
>  
>  out:
>  	return nr_populated;
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 7745ad924ae2..2431d2f6d610 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -541,9 +541,6 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
>  	if (unlikely(pool->alloc.count > 0))
>  		return pool->alloc.cache[--pool->alloc.count];
>  
> -	/* Mark empty alloc.cache slots "empty" for alloc_pages_bulk */
> -	memset(&pool->alloc.cache, 0, sizeof(void *) * bulk);
> -
>  	nr_pages = alloc_pages_bulk_node(gfp, pool->p.nid, bulk,
>  					 (struct page **)pool->alloc.cache);
>  	if (unlikely(!nr_pages))
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index ae25405d8bd2..1191686fc0af 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -653,7 +653,8 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp)
>  {
>  	struct svc_serv *serv = rqstp->rq_server;
>  	struct xdr_buf *arg = &rqstp->rq_arg;
> -	unsigned long pages, filled, ret;
> +	unsigned long pages;
> +	int ret;
>  
>  	pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
>  	if (pages > RPCSVC_MAXPAGES) {
> @@ -663,9 +664,12 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp)
>  		pages = RPCSVC_MAXPAGES;
>  	}
>  
> -	for (filled = 0; filled < pages; filled = ret) {
> -		ret = alloc_pages_bulk(GFP_KERNEL, pages, rqstp->rq_pages);
> -		if (ret > filled)
> +	while (true) {
> +		ret = alloc_pages_bulk_refill(GFP_KERNEL, pages, rqstp->rq_pages);
> +		if (!ret)
> +			break;
> +
> +		if (ret == -EAGAIN)
>  			/* Made progress, don't sleep yet */
>  			continue;
>  
> -- 
> 2.33.0
>