linux-kernel - Re: [PATCH V3] xfs: Avoid races with cnt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZoIUrmB2Jc1KK9Tv@dread.disaster.area>
Date: Mon, 1 Jul 2024 12:30:06 +1000
From: Dave Chinner <david@...morbit.com>
To: Zizhi Wo <wozizhi@...wei.com>
Cc: chandan.babu@...cle.com, djwong@...nel.org, dchinner@...hat.com,
	linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org,
	yangerkun@...wei.com
Subject: Re: [PATCH V3] xfs: Avoid races with cnt_btree lastrec updates

On Tue, Jun 25, 2024 at 09:46:51AM +0800, Zizhi Wo wrote:
> A concurrent file creation and little writing could unexpectedly return
> -ENOSPC error since there is a race window that the allocator could get
> the wrong agf->agf_longest.
> 
> Write file process steps:
> 1) Find the entry that best meets the conditions, then calculate the start
>    address and length of the remaining part of the entry after allocation.
> 2) Delete this entry and update the -current- agf->agf_longest.
> 3) Insert the remaining unused parts of this entry based on the
>    calculations in 1), and update the agf->agf_longest again if necessary.
> 
> Create file process steps:
> 1) Check whether there are free inodes in the inode chunk.
> 2) If there is no free inode, check whether there has space for creating
>    inode chunks, perform the no-lock judgment first.
> 3) If the judgment succeeds, the judgment is performed again with agf lock
>    held. Otherwire, an error is returned directly.
> 
> If the write process is in step 2) but not go to 3) yet, the create file
> process goes to 2) at this time, it may be mistaken for no space,
> resulting in the file system still has space but the file creation fails.
> 
> We have sent two different commits to the community in order to fix this
> problem[1][2]. Unfortunately, both solutions have flaws. In [2], I
> discussed with Dave and Darrick, realized that a better solution to this
> problem requires the "last cnt record tracking" to be ripped out of the
> generic btree code. And surprisingly, Dave directly provided his fix code.
> This patch includes appropriate modifications based on his tmp-code to
> address this issue.
> 
> The entire fix can be roughly divided into two parts:
> 1) Delete the code related to lastrec-update in the generic btree code.
> 2) Place the process of updating longest freespace with cntbt separately
>    to the end of the cntbt modifications. Move the cursor to the rightmost
>    firstly, and update the longest free extent based on the record.
> 
> Note that we can not update the longest with xfs_alloc_get_rec() after
> find the longest record, as xfs_verify_agbno() may not pass because
> pag->block_count is updated on the outside. Therefore, use
> xfs_btree_get_rec() as a replacement.
> 
> [1] https://lore.kernel.org/all/20240419061848.1032366-2-yebin10@huawei.com
> [2] https://lore.kernel.org/all/20240604071121.3981686-1-wozizhi@huawei.com
> 
> Reported by: Ye Bin <yebin10@...wei.com>
> Signed-off-by: Zizhi Wo <wozizhi@...wei.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c       | 115 ++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_alloc_btree.c |  64 ------------------
>  fs/xfs/libxfs/xfs_btree.c       |  51 --------------
>  fs/xfs/libxfs/xfs_btree.h       |  16 +----
>  4 files changed, 116 insertions(+), 130 deletions(-)

Mostly looks good. One small thing to fix, though.

> +/*
> + * Find the rightmost record of the cntbt, and return the longest free space
> + * recorded in it. Simply set both the block number and the length to their
> + * maximum values before searching.
> + */
> +static int
> +xfs_cntbt_longest(
> +	struct xfs_btree_cur	*cnt_cur,
> +	xfs_extlen_t		*longest)
> +{
> +	struct xfs_alloc_rec_incore irec;
> +	union xfs_btree_rec	    *rec;
> +	int			    stat = 0;
> +	int			    error;
> +
> +	memset(&cnt_cur->bc_rec, 0xFF, sizeof(cnt_cur->bc_rec));
> +	error = xfs_btree_lookup(cnt_cur, XFS_LOOKUP_LE, &stat);
> +	if (error)
> +		return error;
> +	if (!stat) {
> +		/* totally empty tree */
> +		*longest = 0;
> +		return 0;
> +	}
> +
> +	error = xfs_btree_get_rec(cnt_cur, &rec, &stat);
> +	if (error)
> +		return error;
> +	if (!stat) {
> +		ASSERT(0);
> +		*longest = 0;
> +		return 0;

If we don't find a record, some kind of btree corruption has been
encountered. Rather than "ASSERT(0)" here, this should fail in
production systems in a way that admins and online repair will
notice:

	if (XFS_IS_CORRUPT(mp, stat != 0)) {
		xfs_btree_mark_sick(cnt_cur);
		return -EFSCORRUPTED;
	}

-Dave.
-- 
Dave Chinner
david@...morbit.com