netdev - Re: [RFC] mm: alloc_pages_bulk: remove assumption of populating only NULL elements

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c9950a79-7bcb-41c2-a59e-af315dc6d7ff@huawei.com>
Date: Wed, 19 Feb 2025 19:20:04 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Dave Chinner <david@...morbit.com>
CC: Yishai Hadas <yishaih@...dia.com>, Jason Gunthorpe <jgg@...pe.ca>, Shameer
 Kolothum <shameerali.kolothum.thodi@...wei.com>, Kevin Tian
	<kevin.tian@...el.com>, Alex Williamson <alex.williamson@...hat.com>, Chris
 Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>, David Sterba
	<dsterba@...e.com>, Gao Xiang <xiang@...nel.org>, Chao Yu <chao@...nel.org>,
	Yue Hu <zbestahu@...il.com>, Jeffle Xu <jefflexu@...ux.alibaba.com>, Sandeep
 Dhavale <dhavale@...gle.com>, Carlos Maiolino <cem@...nel.org>, "Darrick J.
 Wong" <djwong@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>, Jesper
 Dangaard Brouer <hawk@...nel.org>, Ilias Apalodimas
	<ilias.apalodimas@...aro.org>, "David S. Miller" <davem@...emloft.net>, Eric
 Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni
	<pabeni@...hat.com>, Simon Horman <horms@...nel.org>, Trond Myklebust
	<trondmy@...nel.org>, Anna Schumaker <anna@...nel.org>, Chuck Lever
	<chuck.lever@...cle.com>, Jeff Layton <jlayton@...nel.org>, Neil Brown
	<neilb@...e.de>, Olga Kornievskaia <okorniev@...hat.com>, Dai Ngo
	<Dai.Ngo@...cle.com>, Tom Talpey <tom@...pey.com>, Luiz Capitulino
	<luizcap@...hat.com>, Mel Gorman <mgorman@...hsingularity.net>,
	<kvm@...r.kernel.org>, <virtualization@...ts.linux.dev>,
	<linux-kernel@...r.kernel.org>, <linux-btrfs@...r.kernel.org>,
	<linux-erofs@...ts.ozlabs.org>, <linux-xfs@...r.kernel.org>,
	<linux-mm@...ck.org>, <netdev@...r.kernel.org>, <linux-nfs@...r.kernel.org>
Subject: Re: [RFC] mm: alloc_pages_bulk: remove assumption of populating only
 NULL elements

On 2025/2/19 5:14, Dave Chinner wrote:
> On Tue, Feb 18, 2025 at 05:21:27PM +0800, Yunsheng Lin wrote:
>> On 2025/2/18 5:31, Dave Chinner wrote:
>>
>> ...
>>
>>> .....
>>>
>>>> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
>>>> index 15bb790359f8..9e1ce0ab9c35 100644
>>>> --- a/fs/xfs/xfs_buf.c
>>>> +++ b/fs/xfs/xfs_buf.c
>>>> @@ -377,16 +377,17 @@ xfs_buf_alloc_pages(
>>>>  	 * least one extra page.
>>>>  	 */
>>>>  	for (;;) {
>>>> -		long	last = filled;
>>>> +		long	alloc;
>>>>  
>>>> -		filled = alloc_pages_bulk(gfp_mask, bp->b_page_count,
>>>> -					  bp->b_pages);
>>>> +		alloc = alloc_pages_bulk(gfp_mask, bp->b_page_count - refill,
>>>> +					 bp->b_pages + refill);
>>>> +		refill += alloc;
>>>>  		if (filled == bp->b_page_count) {
>>>>  			XFS_STATS_INC(bp->b_mount, xb_page_found);
>>>>  			break;
>>>>  		}
>>>>  
>>>> -		if (filled != last)
>>>> +		if (alloc)
>>>>  			continue;
>>>
>>> You didn't even compile this code - refill is not defined
>>> anywhere.
>>>
>>> Even if it did complile, you clearly didn't test it. The logic is
>>> broken (what updates filled?) and will result in the first
>>> allocation attempt succeeding and then falling into an endless retry
>>> loop.
>>
>> Ah, the 'refill' is a typo, it should be 'filled' instead of 'refill'.
>> The below should fix the compile error:
>> --- a/fs/xfs/xfs_buf.c
>> +++ b/fs/xfs/xfs_buf.c
>> @@ -379,9 +379,9 @@ xfs_buf_alloc_pages(
>>         for (;;) {
>>                 long    alloc;
>>
>> -               alloc = alloc_pages_bulk(gfp_mask, bp->b_page_count - refill,
>> -                                        bp->b_pages + refill);
>> -               refill += alloc;
>> +               alloc = alloc_pages_bulk(gfp_mask, bp->b_page_count - filled,
>> +                                        bp->b_pages + filled);
>> +               filled += alloc;
>>                 if (filled == bp->b_page_count) {
>>                         XFS_STATS_INC(bp->b_mount, xb_page_found);
>>                         break;
>>
>>>
>>> i.e. you stepped on the API landmine of your own creation where
>>> it is impossible to tell the difference between alloc_pages_bulk()
>>> returning "memory allocation failed, you need to retry" and
>>> it returning "array is full, nothing more to allocate". Both these
>>> cases now return 0.
>>
>> As my understanding, alloc_pages_bulk() will not be called when
>> "array is full" as the above 'filled == bp->b_page_count' checking
>> has ensured that if the array is not passed in with holes in the
>> middle for xfs.
> 
> You miss the point entirely. Previously, alloc_pages_bulk() would
> return a value that would tell us the array is full, even if we
> call it with a full array to begin with.
> 
> Now it fails to tell us that the array is full, and we have to track
> that precisely ourselves - it is impossible to tell the difference
> between "array is full" and "allocation failed". Not being able to
> determine from the allocation return value whether the array is
> ready for use or whether another go-around to fill it is needed is a
> very poor API choice, regardless of anything else.
> 
> You've already demonstrated this: tracking array usage in every
> caller is error-prone and much harder to get right than just having
> alloc_pages_bulk() do everything for us.

While I am agreed that it might be hard to track array usage in every
caller to see if removing assumption of populating only NULL elements
cause problem for them, I still think the page bulk alloc API before
this patch have some space for improvement from performance and
easy-to-use perspective as the most existing calllers of page bulk
alloc API are trying to bulk allocate the page for the whole array
sequentially.

> 
>>> The existing code returns nr_populated in both cases, so it doesn't
>>> matter why alloc_pages_bulk() returns with nr_populated != full, it
>>> is very clear that we still need to allocate more memory to fill it.
>>
>> I am not sure if the array will be passed in with holes in the
>> middle for the xfs fs as mentioned above, if not, it seems to be
>> a typical use case like the one in mempolicy.c as below:
>>
>> https://elixir.bootlin.com/linux/v6.14-rc1/source/mm/mempolicy.c#L2525
> 
> That's not "typical" usage. That is implementing "try alloc" fast
> path that avoids memory reclaim with a slow path fallback to fill
> the rest of the array when the fast path fails.
> 
> No other users of alloc_pages_bulk() is trying to do this.

What I meant by "typical" usage is the 'page_array + nr_allocated'
trick that avoids the NULL checking when page bulk allocation API
is used in mm/mempolicy.c, most of existing callers for page bulk
allocation in other places seems likely to be changed to do the
similar trick as this patch does.

> 
> Indeed, it looks somewhat pointless to do this here (i.e. premature
> optimisation!), because the only caller of
> alloc_pages_bulk_mempolicy_noprof() has it's own fallback slowpath
> for when alloc_pages_bulk() can't fill the entire request.
> 
>>> IOWs, you just demonstrated why the existing API is more desirable
>>> than a highly constrained, slightly faster API that requires callers
>>> to get every detail right. i.e. it's hard to get it wrong with the
>>> existing API, yet it's so easy to make mistakes with the proposed
>>> API that the patch proposing the change has serious bugs in it.
>>
>> IMHO, if the API is about refilling pages for the only NULL elements,
>> it seems better to add a API like refill_pages_bulk() for that, as
>> the current API seems to be prone to error of not initializing the
>> array to zero before calling alloc_pages_bulk().
> 
> How is requiring a well defined initial state for API parameters
> "error prone"?  What code is failing to do the well known, defined
> initialisation before calling alloc_pages_bulk()?
> 
> Allowing uninitialised structures in an API (i.e. unknown initial
> conditions) means we cannot make assumptions about the structure
> contents within the API implementation.  We cannot assume that all
> variables are zero on the first use, nor can we assume that anything
> that is zero has a valid state.

It seems the above is the main differenece we see from the API perspective,
as I see the array as output parameter and you seems to treat the array as
both input and output parameter?

The kmem_cache_alloc_bulk() API related API seems to treat the array as
output parameter too as this patch does, the difference from this patch
is that if there is no enough memory, it will free the allocated memory
and return 0 to the caller while this patch returns already allocated
memory to its caller even when there is no enough memory.

> 
> Again, this is poor API design - structures passed to interfaces
> -should- have a well defined initial state, either set by a *_init()
> function or by defining the initial state to be all zeros (i.e. via
> memset, kzalloc, etc).
> 
> Performance and speed is not an excuse for writing fragile, easy to
> break code and APIs.
> 
> -Dave.