[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZoIUrmB2Jc1KK9Tv@dread.disaster.area>
Date: Mon, 1 Jul 2024 12:30:06 +1000
From: Dave Chinner <david@...morbit.com>
To: Zizhi Wo <wozizhi@...wei.com>
Cc: chandan.babu@...cle.com, djwong@...nel.org, dchinner@...hat.com,
linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org,
yangerkun@...wei.com
Subject: Re: [PATCH V3] xfs: Avoid races with cnt_btree lastrec updates
On Tue, Jun 25, 2024 at 09:46:51AM +0800, Zizhi Wo wrote:
> A concurrent file creation and little writing could unexpectedly return
> -ENOSPC error since there is a race window that the allocator could get
> the wrong agf->agf_longest.
>
> Write file process steps:
> 1) Find the entry that best meets the conditions, then calculate the start
> address and length of the remaining part of the entry after allocation.
> 2) Delete this entry and update the -current- agf->agf_longest.
> 3) Insert the remaining unused parts of this entry based on the
> calculations in 1), and update the agf->agf_longest again if necessary.
>
> Create file process steps:
> 1) Check whether there are free inodes in the inode chunk.
> 2) If there is no free inode, check whether there has space for creating
> inode chunks, perform the no-lock judgment first.
> 3) If the judgment succeeds, the judgment is performed again with agf lock
> held. Otherwire, an error is returned directly.
>
> If the write process is in step 2) but not go to 3) yet, the create file
> process goes to 2) at this time, it may be mistaken for no space,
> resulting in the file system still has space but the file creation fails.
>
> We have sent two different commits to the community in order to fix this
> problem[1][2]. Unfortunately, both solutions have flaws. In [2], I
> discussed with Dave and Darrick, realized that a better solution to this
> problem requires the "last cnt record tracking" to be ripped out of the
> generic btree code. And surprisingly, Dave directly provided his fix code.
> This patch includes appropriate modifications based on his tmp-code to
> address this issue.
>
> The entire fix can be roughly divided into two parts:
> 1) Delete the code related to lastrec-update in the generic btree code.
> 2) Place the process of updating longest freespace with cntbt separately
> to the end of the cntbt modifications. Move the cursor to the rightmost
> firstly, and update the longest free extent based on the record.
>
> Note that we can not update the longest with xfs_alloc_get_rec() after
> find the longest record, as xfs_verify_agbno() may not pass because
> pag->block_count is updated on the outside. Therefore, use
> xfs_btree_get_rec() as a replacement.
>
> [1] https://lore.kernel.org/all/20240419061848.1032366-2-yebin10@huawei.com
> [2] https://lore.kernel.org/all/20240604071121.3981686-1-wozizhi@huawei.com
>
> Reported by: Ye Bin <yebin10@...wei.com>
> Signed-off-by: Zizhi Wo <wozizhi@...wei.com>
> ---
> fs/xfs/libxfs/xfs_alloc.c | 115 ++++++++++++++++++++++++++++++++
> fs/xfs/libxfs/xfs_alloc_btree.c | 64 ------------------
> fs/xfs/libxfs/xfs_btree.c | 51 --------------
> fs/xfs/libxfs/xfs_btree.h | 16 +----
> 4 files changed, 116 insertions(+), 130 deletions(-)
Mostly looks good. One small thing to fix, though.
> +/*
> + * Find the rightmost record of the cntbt, and return the longest free space
> + * recorded in it. Simply set both the block number and the length to their
> + * maximum values before searching.
> + */
> +static int
> +xfs_cntbt_longest(
> + struct xfs_btree_cur *cnt_cur,
> + xfs_extlen_t *longest)
> +{
> + struct xfs_alloc_rec_incore irec;
> + union xfs_btree_rec *rec;
> + int stat = 0;
> + int error;
> +
> + memset(&cnt_cur->bc_rec, 0xFF, sizeof(cnt_cur->bc_rec));
> + error = xfs_btree_lookup(cnt_cur, XFS_LOOKUP_LE, &stat);
> + if (error)
> + return error;
> + if (!stat) {
> + /* totally empty tree */
> + *longest = 0;
> + return 0;
> + }
> +
> + error = xfs_btree_get_rec(cnt_cur, &rec, &stat);
> + if (error)
> + return error;
> + if (!stat) {
> + ASSERT(0);
> + *longest = 0;
> + return 0;
If we don't find a record, some kind of btree corruption has been
encountered. Rather than "ASSERT(0)" here, this should fail in
production systems in a way that admins and online repair will
notice:
if (XFS_IS_CORRUPT(mp, stat != 0)) {
xfs_btree_mark_sick(cnt_cur);
return -EFSCORRUPTED;
}
-Dave.
--
Dave Chinner
david@...morbit.com
Powered by blists - more mailing lists