lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 5 Jun 2024 11:38:18 +0800
From: Zizhi Wo <wozizhi@...wei.com>
To: "Darrick J. Wong" <djwong@...nel.org>
CC: <chandan.babu@...cle.com>, <dchinner@...hat.com>,
	<linux-xfs@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<yangerkun@...wei.com>
Subject: Re: [PATCH] xfs: Fix file creation failure



在 2024/6/4 23:56, Darrick J. Wong 写道:
> On Tue, Jun 04, 2024 at 03:11:21PM +0800, Zizhi Wo wrote:
>> We have an xfs image that contains only 2 AGs, the first AG is full and
>> the second AG is empty, then a concurrent file creation and little writing
>> could unexpectedly return -ENOSPC error since there is a race window that
>> the allocator could get the wrong agf->agf_longest.
>>
>> Write file process steps:
>> 1) Find the entry that best meets the conditions, then calculate the start
>> address and length of the remaining part of the entry after allocation.
>> 2) Delete this entry. Because the second AG is empty, the btree in its agf
>> has only one record, and agf->agf_longest will be set to 0 after deletion.
>> 3) Insert the remaining unused parts of this entry based on the
>> calculations in 1), and update the agf->agf_longest.
>>
>> Create file process steps:
>> 1) Check whether there are free inodes in the inode chunk.
>> 2) If there is no free inode, check whether there has space for creating
>> inode chunks, perform the no-lock judgment first.
>> 3) If the judgment succeeds, the judgment is performed again with agf lock
>> held. Otherwire, an error is returned directly.
>>
>> If the write process is in step 2) but not go to 3) yet, the create file
>> process goes to 2) at this time, it will be mistaken for no space,
>> resulting in the file system still has space but the file creation fails.
>>
>> 	Direct write				Create file
>> xfs_file_write_iter
>>   ...
>>   xfs_direct_write_iomap_begin
>>    xfs_iomap_write_direct
>>     ...
>>     xfs_alloc_ag_vextent_near
>>      xfs_alloc_cur_finish
>>       xfs_alloc_fixup_trees
>>        xfs_btree_delete
>>         xfs_btree_delrec
>> 	xfs_allocbt_update_lastrec
>> 	// longest = 0 because numrec == 0.
>> 	 agf->agf_longest = len = 0
>> 					   xfs_create
>> 					    ...
>> 					     xfs_dialloc
>> 					      ...
>> 					      xfs_alloc_fix_freelist
>> 					       xfs_alloc_space_available
>> 					-> as longest=0, it will return
>> 					false, no space for inode alloc.
>>
>> Fix this issue by adding the bc_free_longest field to the xfs_btree_cur_t
>> structure to store the potential longest count that will be updated. The
>> assignment is done in xfs_alloc_fixup_trees() and xfs_free_ag_extent().
> 
> This is going to be a reverse-order review due to the way that diff
> ordered the chunks, which means that the bulk of my questions are at the
> end.
> 
>> Reported by: Ye Bin <yebin10@...wei.com>
>> Signed-off-by: Zizhi Wo <wozizhi@...wei.com>
>> ---
>>   fs/xfs/libxfs/xfs_alloc.c       | 14 ++++++++++++++
>>   fs/xfs/libxfs/xfs_alloc_btree.c |  9 ++++++++-
>>   fs/xfs/libxfs/xfs_btree.h       |  1 +
>>   3 files changed, 23 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
>> index 6c55a6e88eba..86ba873d57a8 100644
>> --- a/fs/xfs/libxfs/xfs_alloc.c
>> +++ b/fs/xfs/libxfs/xfs_alloc.c
>> @@ -577,6 +577,13 @@ xfs_alloc_fixup_trees(
>>   		nfbno2 = rbno + rlen;
>>   		nflen2 = (fbno + flen) - nfbno2;
>>   	}
>> +
>> +	/*
>> +	 * Record the potential maximum free length in advance.
>> +	 */
>> +	if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK)
>> +		cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2);
> 
> Ok, so if we're allocating space then this sets bc_free_longest to the
> longer of the two remaining sections, if any.  But if we just allocated
> the entirety of the longest extent in the cntbt, then we don't set
> bc_free_longest, which means its zero, right?  I guess that's ok because
> that implies there's zero space left in the AG, so the longest freespace
> is indeed zero.
> 
> If we just allocated the entirety of a non-longest extent in the cntbt
> then we don't call ->lastrec_update so the value of bc_free_longest
> doesn't matter?

Yes, absolutely right! Thank you!φ(゜▽゜*)♪

> 
>> +
>>   	/*
>>   	 * Delete the entry from the by-size btree.
>>   	 */
>> @@ -2044,6 +2051,13 @@ xfs_free_ag_extent(
>>   	 * Now allocate and initialize a cursor for the by-size tree.
>>   	 */
>>   	cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
>> +	/*
>> +	 * Record the potential maximum free length in advance.
>> +	 */
>> +	if (haveleft)
>> +		cnt_cur->bc_ag.bc_free_longest = ltlen;
>> +	if (haveright)
>> +		cnt_cur->bc_ag.bc_free_longest = gtlen;
> 
> What happens in the haveleft && haveright case?  Shouldn't
> bc_free_longest be set to ltlen + len + gtlen?  You could just push the
> setting of bc_free_longest into the haveleft/haveright code below.

Oh, as I wrote to Dave, the logic I considered here is that the record
is less than or equal to the maximum value. And no need to worry about
that because it will soon be updated in xfs_btree_insert in the problem
triggering scenario.

> 
>>   	/*
>>   	 * Have both left and right contiguous neighbors.
>>   	 * Merge all three into a single free block.
>> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
>> index 6ef5ddd89600..8e7d1e0f1a63 100644
>> --- a/fs/xfs/libxfs/xfs_alloc_btree.c
>> +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
>> @@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec(
>>   			rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs);
>>   			len = rrp->ar_blockcount;
>>   		} else {
>> -			len = 0;
>> +			/*
>> +			 * Update in advance to prevent file creation failure
>> +			 * for concurrent processes even though there is no
>> +			 * numrec currently.
>> +			 * And there's no need to worry as the value that no
>> +			 * less than bc_free_longest will be inserted later.
>> +			 */
>> +			len = cpu_to_be32(cur->bc_ag.bc_free_longest);
> 
> Humm.  In this case, we've called ->update_lastrec on the cntbt cursor
> having deleted all the records in this record block.  Presumably that
> means that we're going to add rec->alloc.ar_blockcount blocks to the
> rightmost record in the left sibling of @block?  Or already have?
> 

In normal delete operations, cntbt will have a balancing process, moving
data from other nodes or merging to ensure that numrecs >= get_minrecs.
In this scenario, the cntbt is already an -empty- tree, and is in a
temporary state, new values will be inserted later.

> Ahh, right, the pagf_longest checks are done without holding AGF lock.
> The concurrent creat call sees this intermediate state (DELREC sets
> pagf_longest to zero, a moment later INSREC/UPDATE set it to the correct
> nonzero value) and decides to ENOSPC because "nobody" has sufficient
> free space.
> 
> I think this phony zero never gets written to disk because although
> we're logging zero into the ondisk and incore agf_longest here, the next
> btree operation will reset it to the correct value.  Right?

Yes, this phony zero will not be recorded to disk. It is just a
temporary condition.

> 
> Would it be simpler to handle this case by duplicating the cntbt cursor
> and walking one record leftward in the tree to find the longest extent,
> rather than using this "bc_free_longest" variable?
> 

In my opinion, this does not solve the problem. As mentioned above, at
this point the cntbt is an -empty- tree with no records.

>>   		}
>>   
>>   		break;
>> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
>> index f93374278aa1..985b1885a643 100644
>> --- a/fs/xfs/libxfs/xfs_btree.h
>> +++ b/fs/xfs/libxfs/xfs_btree.h
>> @@ -281,6 +281,7 @@ struct xfs_btree_cur
>>   			struct xfs_perag	*pag;
>>   			struct xfs_buf		*agbp;
>>   			struct xbtree_afakeroot	*afake;	/* for staging cursor */
>> +			xfs_extlen_t		bc_free_longest; /* potential longest free space */
> 
> This is only used for bnobt/cntbt trees, put it in the per-format
> private data area, please.
> 

OK, I will modify it. More specifically, it is only used for cntbt,
because currently only cntbt can set the XFS_BTGEO_LASTREC_UPDATE flag,
and can call ->update_lastrec.

> If the answer to the question about duplicating the btree cursor is "no"
> then I think this deserves a much longer comment that captures the fact
> that the variable exists to avoid setting pagf_longest to zero for
> benefit of the code that does unlocked scanning of AGs for free space.
> 
> I also wonder if the unlocked ag scan should just note that it observed
> a zero pagf_longest and if no space can be found across all the AGs, to
> try again with locks?
> 
> --D

Currently xfs checks the space using the "check-lock-check again"
algorithm, which I understand to be more efficient. If there are a large
number of AG's and there is no free space in front of them, the
performance may be affected by checking the lock again. So I think
targeting specific AG might be more effective. Although the current code
process has a retry mechanism (in xfs_dialloc), it still can't
completely solve the problem: for example, there is no space for the
first scan, and the second scan has space but the longest is 0 in the
temporary state and return -ENOSPC, etc...

Thanks,
Zizhi Wo

> 
>>   		} bc_ag;
>>   		struct {
>>   			struct xfbtree		*xfbtree;
>> -- 
>> 2.39.2
>>
>>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ