linux-kernel - Re: [PATCH V3] xfs: Avoid races with cnt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <3f2edec7-5be9-4bf4-bc34-f64072a61336@huawei.com>
Date: Mon, 1 Jul 2024 10:49:42 +0800
From: Zizhi Wo <wozizhi@...wei.com>
To: Dave Chinner <david@...morbit.com>
CC: <chandan.babu@...cle.com>, <djwong@...nel.org>, <dchinner@...hat.com>,
	<linux-xfs@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<yangerkun@...wei.com>
Subject: Re: [PATCH V3] xfs: Avoid races with cnt_btree lastrec updates



在 2024/7/1 10:30, Dave Chinner 写道:
> On Tue, Jun 25, 2024 at 09:46:51AM +0800, Zizhi Wo wrote:
>> A concurrent file creation and little writing could unexpectedly return
>> -ENOSPC error since there is a race window that the allocator could get
>> the wrong agf->agf_longest.
>>
>> Write file process steps:
>> 1) Find the entry that best meets the conditions, then calculate the start
>>     address and length of the remaining part of the entry after allocation.
>> 2) Delete this entry and update the -current- agf->agf_longest.
>> 3) Insert the remaining unused parts of this entry based on the
>>     calculations in 1), and update the agf->agf_longest again if necessary.
>>
>> Create file process steps:
>> 1) Check whether there are free inodes in the inode chunk.
>> 2) If there is no free inode, check whether there has space for creating
>>     inode chunks, perform the no-lock judgment first.
>> 3) If the judgment succeeds, the judgment is performed again with agf lock
>>     held. Otherwire, an error is returned directly.
>>
>> If the write process is in step 2) but not go to 3) yet, the create file
>> process goes to 2) at this time, it may be mistaken for no space,
>> resulting in the file system still has space but the file creation fails.
>>
>> We have sent two different commits to the community in order to fix this
>> problem[1][2]. Unfortunately, both solutions have flaws. In [2], I
>> discussed with Dave and Darrick, realized that a better solution to this
>> problem requires the "last cnt record tracking" to be ripped out of the
>> generic btree code. And surprisingly, Dave directly provided his fix code.
>> This patch includes appropriate modifications based on his tmp-code to
>> address this issue.
>>
>> The entire fix can be roughly divided into two parts:
>> 1) Delete the code related to lastrec-update in the generic btree code.
>> 2) Place the process of updating longest freespace with cntbt separately
>>     to the end of the cntbt modifications. Move the cursor to the rightmost
>>     firstly, and update the longest free extent based on the record.
>>
>> Note that we can not update the longest with xfs_alloc_get_rec() after
>> find the longest record, as xfs_verify_agbno() may not pass because
>> pag->block_count is updated on the outside. Therefore, use
>> xfs_btree_get_rec() as a replacement.
>>
>> [1] https://lore.kernel.org/all/20240419061848.1032366-2-yebin10@huawei.com
>> [2] https://lore.kernel.org/all/20240604071121.3981686-1-wozizhi@huawei.com
>>
>> Reported by: Ye Bin <yebin10@...wei.com>
>> Signed-off-by: Zizhi Wo <wozizhi@...wei.com>
>> ---
>>   fs/xfs/libxfs/xfs_alloc.c       | 115 ++++++++++++++++++++++++++++++++
>>   fs/xfs/libxfs/xfs_alloc_btree.c |  64 ------------------
>>   fs/xfs/libxfs/xfs_btree.c       |  51 --------------
>>   fs/xfs/libxfs/xfs_btree.h       |  16 +----
>>   4 files changed, 116 insertions(+), 130 deletions(-)
> 
> Mostly looks good. One small thing to fix, though.
> 
>> +/*
>> + * Find the rightmost record of the cntbt, and return the longest free space
>> + * recorded in it. Simply set both the block number and the length to their
>> + * maximum values before searching.
>> + */
>> +static int
>> +xfs_cntbt_longest(
>> +	struct xfs_btree_cur	*cnt_cur,
>> +	xfs_extlen_t		*longest)
>> +{
>> +	struct xfs_alloc_rec_incore irec;
>> +	union xfs_btree_rec	    *rec;
>> +	int			    stat = 0;
>> +	int			    error;
>> +
>> +	memset(&cnt_cur->bc_rec, 0xFF, sizeof(cnt_cur->bc_rec));
>> +	error = xfs_btree_lookup(cnt_cur, XFS_LOOKUP_LE, &stat);
>> +	if (error)
>> +		return error;
>> +	if (!stat) {
>> +		/* totally empty tree */
>> +		*longest = 0;
>> +		return 0;
>> +	}
>> +
>> +	error = xfs_btree_get_rec(cnt_cur, &rec, &stat);
>> +	if (error)
>> +		return error;
>> +	if (!stat) {
>> +		ASSERT(0);
>> +		*longest = 0;
>> +		return 0;
> 
> If we don't find a record, some kind of btree corruption has been
> encountered. Rather than "ASSERT(0)" here, this should fail in
> production systems in a way that admins and online repair will
> notice:
> 
> 	if (XFS_IS_CORRUPT(mp, stat != 0)) {
> 		xfs_btree_mark_sick(cnt_cur);
> 		return -EFSCORRUPTED;
> 	}
> 
> -Dave.

Yes, that seems more reasonable. I will send the V4 patch.

Thanks,
Zizhi Wo