[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f42d35b8-7c04-4d69-b6fe-07b9e149d262@huawei.com>
Date: Wed, 5 Jun 2024 11:38:18 +0800
From: Zizhi Wo <wozizhi@...wei.com>
To: "Darrick J. Wong" <djwong@...nel.org>
CC: <chandan.babu@...cle.com>, <dchinner@...hat.com>,
<linux-xfs@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<yangerkun@...wei.com>
Subject: Re: [PATCH] xfs: Fix file creation failure
在 2024/6/4 23:56, Darrick J. Wong 写道:
> On Tue, Jun 04, 2024 at 03:11:21PM +0800, Zizhi Wo wrote:
>> We have an xfs image that contains only 2 AGs, the first AG is full and
>> the second AG is empty, then a concurrent file creation and little writing
>> could unexpectedly return -ENOSPC error since there is a race window that
>> the allocator could get the wrong agf->agf_longest.
>>
>> Write file process steps:
>> 1) Find the entry that best meets the conditions, then calculate the start
>> address and length of the remaining part of the entry after allocation.
>> 2) Delete this entry. Because the second AG is empty, the btree in its agf
>> has only one record, and agf->agf_longest will be set to 0 after deletion.
>> 3) Insert the remaining unused parts of this entry based on the
>> calculations in 1), and update the agf->agf_longest.
>>
>> Create file process steps:
>> 1) Check whether there are free inodes in the inode chunk.
>> 2) If there is no free inode, check whether there has space for creating
>> inode chunks, perform the no-lock judgment first.
>> 3) If the judgment succeeds, the judgment is performed again with agf lock
>> held. Otherwire, an error is returned directly.
>>
>> If the write process is in step 2) but not go to 3) yet, the create file
>> process goes to 2) at this time, it will be mistaken for no space,
>> resulting in the file system still has space but the file creation fails.
>>
>> Direct write Create file
>> xfs_file_write_iter
>> ...
>> xfs_direct_write_iomap_begin
>> xfs_iomap_write_direct
>> ...
>> xfs_alloc_ag_vextent_near
>> xfs_alloc_cur_finish
>> xfs_alloc_fixup_trees
>> xfs_btree_delete
>> xfs_btree_delrec
>> xfs_allocbt_update_lastrec
>> // longest = 0 because numrec == 0.
>> agf->agf_longest = len = 0
>> xfs_create
>> ...
>> xfs_dialloc
>> ...
>> xfs_alloc_fix_freelist
>> xfs_alloc_space_available
>> -> as longest=0, it will return
>> false, no space for inode alloc.
>>
>> Fix this issue by adding the bc_free_longest field to the xfs_btree_cur_t
>> structure to store the potential longest count that will be updated. The
>> assignment is done in xfs_alloc_fixup_trees() and xfs_free_ag_extent().
>
> This is going to be a reverse-order review due to the way that diff
> ordered the chunks, which means that the bulk of my questions are at the
> end.
>
>> Reported by: Ye Bin <yebin10@...wei.com>
>> Signed-off-by: Zizhi Wo <wozizhi@...wei.com>
>> ---
>> fs/xfs/libxfs/xfs_alloc.c | 14 ++++++++++++++
>> fs/xfs/libxfs/xfs_alloc_btree.c | 9 ++++++++-
>> fs/xfs/libxfs/xfs_btree.h | 1 +
>> 3 files changed, 23 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
>> index 6c55a6e88eba..86ba873d57a8 100644
>> --- a/fs/xfs/libxfs/xfs_alloc.c
>> +++ b/fs/xfs/libxfs/xfs_alloc.c
>> @@ -577,6 +577,13 @@ xfs_alloc_fixup_trees(
>> nfbno2 = rbno + rlen;
>> nflen2 = (fbno + flen) - nfbno2;
>> }
>> +
>> + /*
>> + * Record the potential maximum free length in advance.
>> + */
>> + if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK)
>> + cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2);
>
> Ok, so if we're allocating space then this sets bc_free_longest to the
> longer of the two remaining sections, if any. But if we just allocated
> the entirety of the longest extent in the cntbt, then we don't set
> bc_free_longest, which means its zero, right? I guess that's ok because
> that implies there's zero space left in the AG, so the longest freespace
> is indeed zero.
>
> If we just allocated the entirety of a non-longest extent in the cntbt
> then we don't call ->lastrec_update so the value of bc_free_longest
> doesn't matter?
Yes, absolutely right! Thank you!φ(゜▽゜*)♪
>
>> +
>> /*
>> * Delete the entry from the by-size btree.
>> */
>> @@ -2044,6 +2051,13 @@ xfs_free_ag_extent(
>> * Now allocate and initialize a cursor for the by-size tree.
>> */
>> cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
>> + /*
>> + * Record the potential maximum free length in advance.
>> + */
>> + if (haveleft)
>> + cnt_cur->bc_ag.bc_free_longest = ltlen;
>> + if (haveright)
>> + cnt_cur->bc_ag.bc_free_longest = gtlen;
>
> What happens in the haveleft && haveright case? Shouldn't
> bc_free_longest be set to ltlen + len + gtlen? You could just push the
> setting of bc_free_longest into the haveleft/haveright code below.
Oh, as I wrote to Dave, the logic I considered here is that the record
is less than or equal to the maximum value. And no need to worry about
that because it will soon be updated in xfs_btree_insert in the problem
triggering scenario.
>
>> /*
>> * Have both left and right contiguous neighbors.
>> * Merge all three into a single free block.
>> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
>> index 6ef5ddd89600..8e7d1e0f1a63 100644
>> --- a/fs/xfs/libxfs/xfs_alloc_btree.c
>> +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
>> @@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec(
>> rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs);
>> len = rrp->ar_blockcount;
>> } else {
>> - len = 0;
>> + /*
>> + * Update in advance to prevent file creation failure
>> + * for concurrent processes even though there is no
>> + * numrec currently.
>> + * And there's no need to worry as the value that no
>> + * less than bc_free_longest will be inserted later.
>> + */
>> + len = cpu_to_be32(cur->bc_ag.bc_free_longest);
>
> Humm. In this case, we've called ->update_lastrec on the cntbt cursor
> having deleted all the records in this record block. Presumably that
> means that we're going to add rec->alloc.ar_blockcount blocks to the
> rightmost record in the left sibling of @block? Or already have?
>
In normal delete operations, cntbt will have a balancing process, moving
data from other nodes or merging to ensure that numrecs >= get_minrecs.
In this scenario, the cntbt is already an -empty- tree, and is in a
temporary state, new values will be inserted later.
> Ahh, right, the pagf_longest checks are done without holding AGF lock.
> The concurrent creat call sees this intermediate state (DELREC sets
> pagf_longest to zero, a moment later INSREC/UPDATE set it to the correct
> nonzero value) and decides to ENOSPC because "nobody" has sufficient
> free space.
>
> I think this phony zero never gets written to disk because although
> we're logging zero into the ondisk and incore agf_longest here, the next
> btree operation will reset it to the correct value. Right?
Yes, this phony zero will not be recorded to disk. It is just a
temporary condition.
>
> Would it be simpler to handle this case by duplicating the cntbt cursor
> and walking one record leftward in the tree to find the longest extent,
> rather than using this "bc_free_longest" variable?
>
In my opinion, this does not solve the problem. As mentioned above, at
this point the cntbt is an -empty- tree with no records.
>> }
>>
>> break;
>> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
>> index f93374278aa1..985b1885a643 100644
>> --- a/fs/xfs/libxfs/xfs_btree.h
>> +++ b/fs/xfs/libxfs/xfs_btree.h
>> @@ -281,6 +281,7 @@ struct xfs_btree_cur
>> struct xfs_perag *pag;
>> struct xfs_buf *agbp;
>> struct xbtree_afakeroot *afake; /* for staging cursor */
>> + xfs_extlen_t bc_free_longest; /* potential longest free space */
>
> This is only used for bnobt/cntbt trees, put it in the per-format
> private data area, please.
>
OK, I will modify it. More specifically, it is only used for cntbt,
because currently only cntbt can set the XFS_BTGEO_LASTREC_UPDATE flag,
and can call ->update_lastrec.
> If the answer to the question about duplicating the btree cursor is "no"
> then I think this deserves a much longer comment that captures the fact
> that the variable exists to avoid setting pagf_longest to zero for
> benefit of the code that does unlocked scanning of AGs for free space.
>
> I also wonder if the unlocked ag scan should just note that it observed
> a zero pagf_longest and if no space can be found across all the AGs, to
> try again with locks?
>
> --D
Currently xfs checks the space using the "check-lock-check again"
algorithm, which I understand to be more efficient. If there are a large
number of AG's and there is no free space in front of them, the
performance may be affected by checking the lock again. So I think
targeting specific AG might be more effective. Although the current code
process has a retry mechanism (in xfs_dialloc), it still can't
completely solve the problem: for example, there is no space for the
first scan, and the second scan has space but the longest is 0 in the
temporary state and return -ENOSPC, etc...
Thanks,
Zizhi Wo
>
>> } bc_ag;
>> struct {
>> struct xfbtree *xfbtree;
>> --
>> 2.39.2
>>
>>
Powered by blists - more mailing lists