[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221116025106.GB3600936@dread.disaster.area>
Date: Wed, 16 Nov 2022 13:51:06 +1100
From: Dave Chinner <david@...morbit.com>
To: linux-xfs@...r.kernel.org, "Darrick J. Wong" <djwong@...nel.org>,
Dave Chinner <dchinner@...hat.com>,
Brian Foster <bfoster@...hat.com>,
LKML <linux-kernel@...r.kernel.org>,
Zirong Lang <zlang@...hat.com>
Subject: Re: [PATCH] xfs: account extra freespace btree splits for multiple
allocations
On Tue, Nov 15, 2022 at 03:57:54PM +0800, Gao Xiang wrote:
> On Sun, Nov 13, 2022 at 08:45:45AM +1100, Dave Chinner wrote:
> > On Sat, Nov 12, 2022 at 07:46:33AM +0800, Gao Xiang wrote:
> > > On Sat, Nov 12, 2022 at 07:39:05AM +1100, Dave Chinner wrote:
> > > > On Wed, Nov 09, 2022 at 11:48:02AM +0800, Gao Xiang wrote:
> > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > > > > index 6261599bb389..684c67310175 100644
> > > > > --- a/fs/xfs/libxfs/xfs_alloc.c
> > > > > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > > > > @@ -2630,7 +2630,12 @@ xfs_alloc_fix_freelist(
> > > > > goto out_agbp_relse;
> > > > > }
> > > > >
> > > > > - need = xfs_alloc_min_freelist(mp, pag);
> > > > > + /
> > > > > + * Also need to fulfill freespace btree splits by reservaing more
> > > > > + * blocks to perform multiple allocations from a single AG and
> > > > > + * transaction if needed.
> > > > > + */
> > > > > + need = xfs_alloc_min_freelist(mp, pag) * (1 + args->postallocs);
> > > > > if (!xfs_alloc_space_available(args, need, flags |
> > > > > XFS_ALLOC_FLAG_CHECK))
> > > > > goto out_agbp_relse;
> > > > > @@ -2654,7 +2659,7 @@ xfs_alloc_fix_freelist(
> > > > > xfs_agfl_reset(tp, agbp, pag);
> > > > >
> > > > > /* If there isn't enough total space or single-extent, reject it. */
> > > > > - need = xfs_alloc_min_freelist(mp, pag);
> > > > > + need = xfs_alloc_min_freelist(mp, pag) * (1 + args->postallocs);
> > > > > if (!xfs_alloc_space_available(args, need, flags))
> > > > > goto out_agbp_relse;
> > > > >
> > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> > > > > index 2c3f762dfb58..be7f15d6a40d 100644
> > > > > --- a/fs/xfs/libxfs/xfs_alloc.h
> > > > > +++ b/fs/xfs/libxfs/xfs_alloc.h
> > > > > @@ -73,6 +73,7 @@ typedef struct xfs_alloc_arg {
> > > > > int datatype; /* mask defining data type treatment */
> > > > > char wasdel; /* set if allocation was prev delayed */
> > > > > char wasfromfl; /* set if allocation is from freelist */
> > > > > + bool postallocs; /* number of post-allocations */
> > > > > struct xfs_owner_info oinfo; /* owner of blocks being allocated */
> > > > > enum xfs_ag_resv_type resv; /* block reservation to use */
> > > > > #ifdef DEBUG
> > > > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > > > index 49d0d4ea63fc..ed92c6a314b6 100644
> > > > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > > > @@ -3497,6 +3497,7 @@ xfs_bmap_exact_minlen_extent_alloc(
> > > > > args.alignment = 1;
> > > > > args.minalignslop = 0;
> > > > >
> > > > > + args.postallocs = 1;
> > > > > args.minleft = ap->minleft;
> > > > > args.wasdel = ap->wasdel;
> > > > > args.resv = XFS_AG_RESV_NONE;
> > > > > @@ -3658,6 +3659,7 @@ xfs_bmap_btalloc(
> > > > > args.alignment = 1;
> > > > > args.minalignslop = 0;
> > > > > }
> > > > > + args.postallocs = 1;
> > > > > args.minleft = ap->minleft;
> > > > > args.wasdel = ap->wasdel;
> > > > > args.resv = XFS_AG_RESV_NONE;
> > > >
> > > > That's not going to work. What happens when we do a full bno
> > > > split? Or we do both a bno and a cnt split in the same allocation?
> > >
> > > I'm not sure if I got your point or not. I think it reserves another
> > > full splits in the first allocation by doing:
> > >
> > > need = xfs_alloc_min_freelist(mp, pag) * (1 + args->postallocs);
> > >
> > > as I wrote above.
> >
> > You're changing the BMBT reservation code. If the first "post-extent
> > BMBT block allocation" does a full split of both the bno/cnt trees,
> > then this uses all the AGFL reservations made.
>
> Emmm... I have to align my understanding of this first, I think one
> example of what you meant is
> 1. allocate an extent for an inode with minleft = 1;
> 2. then do extents-to-btree allocation with one block, even minleft
> was reserved as 1 in the previous allocation but such one-block
> allocation from non-AGFL can cause full bno/cnt splits, which
> could takes xfs_alloc_min_freelist() blocks from AGFL and could
> take up all AGFL blocks?
>
> If my understanding above is like what you said, I think the current
> codebase may also have a chance to eat up all AGFL blocks in the first
> allocation since more agfl blocks are only filled in
> xfs_alloc_fix_freelist(), but later xfs_alloc_ag_vextent() could
> cause full bno/cnt splits as well?
Yes, the second allocation here might only require 1 block, which is
what args->minleft says. But the problem is that nothing is
reserving AGFL blocks for those nested extent allocations...
..... because the assumption is that AGFL blocks come from free
space and so when we are at ENOSPC bno/cnt btrees *do no require
splits* so will not consume extra space. Hence allocation at ENOSPC
doesn't need to take into account AGFL block usage because the AGFL
will not be consumed.
Similarly, if we have enough free space records to split a free
space btree block, we have enough free space to refill the AGFL
multiple times and we don't have to reserve space for them.
IOWs, the allocation code has, historically, never had to care about
AGFL refilling when the AG is near ENOSPC as nothing will consume
AGFL blocks when the AG is near empty.
This is the design assumption that AG reservations broke. This is
why I'm asking you to look into taking blocks that are supposedly
reserved for the AGFL, because as reserved space is used, the
bno/cnt btrees will shrink and return those blocks to free space and
hence they are still available for reserved allocations to use as
the real physical ENOSPC condition approaches.
The more I look at this, the more I think overall answer to this
problem is to allow AGFL refilling to ignore AG reserves rather than
causing ENOSPC....
----
Regardless of the above, answers to the rest of you questions follow.
> Please help correct me if my understanding about your ask is wrong.
>
> >
> > How many blocks does a BMBT split need to allocate?
>
> IMO, a full bmbt split can allocate btree level blocks at maximum,
> but if these block allocation cause bno/cnt btree splits, that
> needs more than such blocks.
And how many individual allocations does that require?
> So I think that's why AGFL is needed
> for XFS. IOWs, that is to prepare enough blocks for bno/cnt splits
> to avoid cyclic dependency.
The AGFL is there to ensure any *one* space allocation succeeds.
> But I'm not sure if the current AGFL reservation works properly
> if multiple allocations must be succeeded in the same AG, see below..)
Right, it does not provide any guarantees across mutliple successive
allocations like an extent + BMBT split chain. That's what
args->minleft is supposed to provide.
However, it does provide the guarantee that when near ENOSPC,
bno/cnt splits and hence AGFL consumption will not occur, thereby
ensuring that if args->minleft is reserved correctly, operation
right up to ENOSPC will work correctly without AGFL reservations
because the AGFL will not be consumed.
Hence my comments above about the problem being the way AG
reservations moved ENOSPC from "AG physically empty" to "AG still
has thousands of free extents but remaining space unavailable to
user data allocation".
> > > > Regardless, I don't see anything wrong with the allocation setup -
> > > > it's telling the allocation code exactly what it needs for
> > > > subsequent BMBT block allocations to succeed (i.e. args->minleft).
> > >
> > > In the long term, I think the main point is that args->minleft doesn't
> > > have the exact meaning. I don't know how many blocks should be counted
> > > by args->minleft or other ways.
> >
> > args->minleft has an *exact* meaning - that the AG must have that
> > many blocks left available for potential btree record insertion
> > allocations after the initial extent is allocated. For inode fork
> > allocations, the BMBT blocks required is defined by
> > xfs_bmapi_minleft(). For inode chunk extent allocation and inobt
> > record insertion, it is defined by the pre-calculated
> > igeo->inobt_maxlevels variable.
> >
> > IOWs, this "postalloc" concept is redundant - minleft already
> > provides the maximum number of single block allocations that need to
> > have space reserved in the AG for the initial extent allocation to
> > succeed. i.e. the allocation setup is already taking into account
> > blocks needed for extra allocations within the AG, but that's not
> > being handled correctly by the AG allocation code.
>
> I don't think it's the case as I described in the patch commit message,
> if we go over the words at the top, the main point is
>
> In the first allocation, minleft = 1, the current allocator assumes
> it can allocate an extent with 27 blocks (the remaining blocks are
> 18276 per-AG reservation, 6 for AGFL reservation, 1 for inode extents
> -to-btree for the following allocation).
>
> But here in order to finish this allocation with 27 blocks, it splits
> cntbt so that it takes another unexpected block from AGFL, and
> that wasn't accounted in minleft (or with any other fields) before.
>
> I don't think it can be directly described by minleft because
> such extra bno/cntbt reservation needs more knowledge of bno/cntbt
> internals (such as current bno/cnt btree levels), so I don't think
> it should belong to BMBT allocation code at least.
>
> So here I introduced another variable to describe the total number
> of post-allocations, I think it's just enough to resolve the inode
> extents-to-btree bno/cntbt reservation issue.
extents-to-btree is the degenerate case of a btree split. It's
moving the in-inode extent block to a single btree root block - it's
the same case as having a multi-level BMBT and splitting a single
leaf block during an xfs_btree_insert() call. Both require a second
discrete allocation to be made in the same transaction from the same
AG.
But if that xfs_btree_insert() call triggers a multi-level btree
split, we've now got more than 1 "post allocation" allocation being
done - there's one allocation for every level that needs to have a
block split. To handle this, we'd need to set up this args.postalloc
variable with the number of allocations a btree split might require.
What I'm trying to tell you is that args->minleft is already
configured with exactly this number of blocks/post-allocations that
the btree split might require, and hence allow the allocation code
to select an AG with the right amount of space needed before it
starts.
> > On review, it is quite possible that args->minleft is not being
> > handled by the BMBT and inobt block allocation code correctly.
> > Shouldn't btree block allocation drop args->minleft by 1 for
> > each block that is allocated?
>
> At least, in order to convert from inode extents-to-btree, we need
> another block for the following allocation, so minleft = 1 here.
>
> if (ifp->if_format != XFS_DINODE_FMT_BTREE)
> return 1;
Yes, as I said above, that's the degenerate case where we only need
to allocate a root block and set the btree level to 1.
> So I guess what you meant is
> return be16_to_cpu(ifp->if_broot->bb_level) + 1; ?
>
> I don't know why it has another 1 here,
It's a btree. What does a full height btree split do?
It adds a block to each existing level, and splits the root block
into two. Which means we need to increase the tree height by 1 and
allocate a new root block. IOWs, the number of allocations/blocks
needed by a full split is (current height + 1).
> yet even if we account an
> extra block here, I think it doesn't have some critical result
> since the worst case is that it just returns -ENOSPC in advance.
>
> But in principle, most users use terabytes XFS, so I think such
> one extra block doesn't matter too much. I will update this if
> such 1 is meaningless, but it doesn't actually contribute to the
> real shutdown issue.
I think you misunderstood what I was asking. Let's unroll the
extent allocation/BMBT record insert loop:
extent allocation
args.minleft = bb_level + 1
xfs_alloc_vextent(args)
bmbt record insert
xfs_btree_insert()
leaf split
xfs_bmbt_alloc_block()
args.minleft = ???
xfs_alloc_vextent(args)
level 1 node split
xfs_bmbt_alloc_block()
args.minleft = ???
xfs_alloc_vextent(args)
level 2 node split
xfs_bmbt_alloc_block()
args.minleft = ???
xfs_alloc_vextent(args)
....
root split
xfs_bmbt_alloc_block()
args.minleft = ???
xfs_alloc_vextent(args)
A BMBT split results in a chain of individual allocations. What
should args.minleft be set to on each of these allocations, and
what context do we have to ensure it is set correctly?
So the question I was asking was whether what we are doing with
args->minleft for each allocation in the chain is correct, and
whether they need modification if we have to take into account the
AGFL block refilling that may need to occur after each BMBT block
allocation?
Indeed, if we get the initial extent allocation reservation correct,
does minleft even matter for the rest of the allocations in the
chain?
Looking at xfs_bmbt_alloc_block(), it sets args.minleft = 0 if there
was a previous allocation in the transaction (i.e. args.fsbno !=
NULLFSBLOCK). It assumes that the original extent reservation set
args.minleft appropriately to reserve enough space for all
subsequent calls to xfs_bmbt_alloc_block() in this transaction to
succeed.
Hence, given the way it is implemented right now, all we need to do
is ensure that the initial allocation has all the space reservation
the entire operation may need and the rest is good, yes?
xfs_bmap_extents_to_btree() also sets args->minleft = 0, so as long
as the first allocation in the transaction has reserved enough
blocks in args->minleft it doesn't need any special help, either.
So, yes, you are right that avoiding ENOSPC when running multiple
allocations in a single transaction is all based on the initial
allocation ensuring there is enough space in the AG for all
subsequent allocations to succeed. But there's a lot more to it than
that....
> > > > The problem here is that the internal allocation code is failing to
> > > > handle the corner case where space is just about gone correctly.
> > > >
> > > > As I pointed out previously - we have a huge amount of reserve space
> > > > available in the AG here, so why not use some of the reserve space
> > > > to get out of this temporary deficit corner case? We can argue that
> > > > it's not really a deficit, either, because moving free blocks to the
> > > > free list still accounts them as unused and free, so could still
> > > > make up part of the unused reservation....
> > > >
> > > > i.e. is the problem here simply that we don't allow AGFL blocks to
> > > > be considered part of the reserved free space?
> > >
> > > I don't know how to simply reuse per-AG reservation blocks for this,
> >
> > I don't know either, which is *why I asked the question*. i.e. I'm
> > asking for you to investigate a potential alternative solution that
> > challenges a design assumption this code makes. i.e. AGBNO and AGCNT
> > btree blocks are considered free space because when we are at ENOSPC
> > they are empty.
> >
> > However, with this ag reservation code, we can be at ENOSPC when
> > there are still tens of thousands of free extents, and hence the
> > AGBNO and AGCNT btree blocks are used space, not free space. The
> > AGFL accounting is based on AGFL blocks being considered free space,
> > which matches the AG btree blocks being considered free space, and
> > so maybe the root of the problem here is the assumption that AG
> > btree blocks and AGFL blocks are accounted as free space rather than
> > part of this new "reserved space"....
>
> I have strong feeling that the current per-AG reservation code (or
> AGFL reservation as in xfs_alloc_min_freelist() ) doesn't work
> properly for multiple allocations in the same AG in order to make
> sure such multiple allocations all succeed.
>
> Also, a wilder question is that I'm not sure why such multiple
> allocations in oneshot _cannot_ be handled with a defer ops as
> some new log intent, so that we don't need to care about minleft
> messy anymore.
We do use intents and deferred ops for BMBT freeing and reflink
based insertion, but those only log changes to individual records in
the btree. They do not record internal btree shape changes at all.
Yes, we could convert normal extent allocation to use these intents
as well, but that doesn't solve the problem of chained allocations
within a single AG.
IOWs, the chain of allocations for a BMBT split I mention above
still exists for record level intents. To handle the btree split
case as a chain of intents involves a whole new level of complexity
and overhead in the btree code, and likely introduces more problems
at ENOSPC than it solves...
Cheers,
Dave.
--
Dave Chinner
david@...morbit.com
Powered by blists - more mailing lists