[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3f2edec7-5be9-4bf4-bc34-f64072a61336@huawei.com>
Date: Mon, 1 Jul 2024 10:49:42 +0800
From: Zizhi Wo <wozizhi@...wei.com>
To: Dave Chinner <david@...morbit.com>
CC: <chandan.babu@...cle.com>, <djwong@...nel.org>, <dchinner@...hat.com>,
<linux-xfs@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<yangerkun@...wei.com>
Subject: Re: [PATCH V3] xfs: Avoid races with cnt_btree lastrec updates
在 2024/7/1 10:30, Dave Chinner 写道:
> On Tue, Jun 25, 2024 at 09:46:51AM +0800, Zizhi Wo wrote:
>> A concurrent file creation and little writing could unexpectedly return
>> -ENOSPC error since there is a race window that the allocator could get
>> the wrong agf->agf_longest.
>>
>> Write file process steps:
>> 1) Find the entry that best meets the conditions, then calculate the start
>> address and length of the remaining part of the entry after allocation.
>> 2) Delete this entry and update the -current- agf->agf_longest.
>> 3) Insert the remaining unused parts of this entry based on the
>> calculations in 1), and update the agf->agf_longest again if necessary.
>>
>> Create file process steps:
>> 1) Check whether there are free inodes in the inode chunk.
>> 2) If there is no free inode, check whether there has space for creating
>> inode chunks, perform the no-lock judgment first.
>> 3) If the judgment succeeds, the judgment is performed again with agf lock
>> held. Otherwire, an error is returned directly.
>>
>> If the write process is in step 2) but not go to 3) yet, the create file
>> process goes to 2) at this time, it may be mistaken for no space,
>> resulting in the file system still has space but the file creation fails.
>>
>> We have sent two different commits to the community in order to fix this
>> problem[1][2]. Unfortunately, both solutions have flaws. In [2], I
>> discussed with Dave and Darrick, realized that a better solution to this
>> problem requires the "last cnt record tracking" to be ripped out of the
>> generic btree code. And surprisingly, Dave directly provided his fix code.
>> This patch includes appropriate modifications based on his tmp-code to
>> address this issue.
>>
>> The entire fix can be roughly divided into two parts:
>> 1) Delete the code related to lastrec-update in the generic btree code.
>> 2) Place the process of updating longest freespace with cntbt separately
>> to the end of the cntbt modifications. Move the cursor to the rightmost
>> firstly, and update the longest free extent based on the record.
>>
>> Note that we can not update the longest with xfs_alloc_get_rec() after
>> find the longest record, as xfs_verify_agbno() may not pass because
>> pag->block_count is updated on the outside. Therefore, use
>> xfs_btree_get_rec() as a replacement.
>>
>> [1] https://lore.kernel.org/all/20240419061848.1032366-2-yebin10@huawei.com
>> [2] https://lore.kernel.org/all/20240604071121.3981686-1-wozizhi@huawei.com
>>
>> Reported by: Ye Bin <yebin10@...wei.com>
>> Signed-off-by: Zizhi Wo <wozizhi@...wei.com>
>> ---
>> fs/xfs/libxfs/xfs_alloc.c | 115 ++++++++++++++++++++++++++++++++
>> fs/xfs/libxfs/xfs_alloc_btree.c | 64 ------------------
>> fs/xfs/libxfs/xfs_btree.c | 51 --------------
>> fs/xfs/libxfs/xfs_btree.h | 16 +----
>> 4 files changed, 116 insertions(+), 130 deletions(-)
>
> Mostly looks good. One small thing to fix, though.
>
>> +/*
>> + * Find the rightmost record of the cntbt, and return the longest free space
>> + * recorded in it. Simply set both the block number and the length to their
>> + * maximum values before searching.
>> + */
>> +static int
>> +xfs_cntbt_longest(
>> + struct xfs_btree_cur *cnt_cur,
>> + xfs_extlen_t *longest)
>> +{
>> + struct xfs_alloc_rec_incore irec;
>> + union xfs_btree_rec *rec;
>> + int stat = 0;
>> + int error;
>> +
>> + memset(&cnt_cur->bc_rec, 0xFF, sizeof(cnt_cur->bc_rec));
>> + error = xfs_btree_lookup(cnt_cur, XFS_LOOKUP_LE, &stat);
>> + if (error)
>> + return error;
>> + if (!stat) {
>> + /* totally empty tree */
>> + *longest = 0;
>> + return 0;
>> + }
>> +
>> + error = xfs_btree_get_rec(cnt_cur, &rec, &stat);
>> + if (error)
>> + return error;
>> + if (!stat) {
>> + ASSERT(0);
>> + *longest = 0;
>> + return 0;
>
> If we don't find a record, some kind of btree corruption has been
> encountered. Rather than "ASSERT(0)" here, this should fail in
> production systems in a way that admins and online repair will
> notice:
>
> if (XFS_IS_CORRUPT(mp, stat != 0)) {
> xfs_btree_mark_sick(cnt_cur);
> return -EFSCORRUPTED;
> }
>
> -Dave.
Yes, that seems more reasonable. I will send the V4 patch.
Thanks,
Zizhi Wo
Powered by blists - more mailing lists