[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250409004156.GL6307@frogsfrogsfrogs>
Date: Tue, 8 Apr 2025 17:41:56 -0700
From: "Darrick J. Wong" <djwong@...nel.org>
To: Dave Chinner <david@...morbit.com>
Cc: John Garry <john.g.garry@...cle.com>, brauner@...nel.org, hch@....de,
viro@...iv.linux.org.uk, jack@...e.cz, cem@...nel.org,
linux-fsdevel@...r.kernel.org, dchinner@...hat.com,
linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org,
ojaswin@...ux.ibm.com, ritesh.list@...il.com,
martin.petersen@...cle.com, linux-ext4@...r.kernel.org,
linux-block@...r.kernel.org, catherine.hoang@...cle.com
Subject: Re: [PATCH v6 11/12] xfs: add xfs_compute_atomic_write_unit_max()
On Wed, Apr 09, 2025 at 08:47:09AM +1000, Dave Chinner wrote:
> On Tue, Apr 08, 2025 at 10:42:08AM +0000, John Garry wrote:
> > Now that CoW-based atomic writes are supported, update the max size of an
> > atomic write for the data device.
> >
> > The limit of a CoW-based atomic write will be the limit of the number of
> > logitems which can fit into a single transaction.
>
> I still think this is the wrong way to define the maximum
> size of a COW-based atomic write because it is going to change from
> filesystem to filesystem and that variability in supported maximum
> length will be exposed to userspace...
>
> i.e. Maximum supported atomic write size really should be defined as
> a well documented fixed size (e.g. 16MB). Then the transaction
> reservations sizes needed to perform that conversion can be
> calculated directly from that maximum size and optimised directly
> for the conversion operation that atomic writes need to perform.
I'll get to this below...
> .....
>
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index b2dd0c0bf509..42b2b7540507 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -615,6 +615,28 @@ xfs_init_mount_workqueues(
> > return -ENOMEM;
> > }
> >
> > +unsigned int
> > +xfs_atomic_write_logitems(
> > + struct xfs_mount *mp)
> > +{
> > + unsigned int efi = xfs_efi_item_overhead(1);
> > + unsigned int rui = xfs_rui_item_overhead(1);
> > + unsigned int cui = xfs_cui_item_overhead(1);
> > + unsigned int bui = xfs_bui_item_overhead(1);
> > + unsigned int logres = M_RES(mp)->tr_write.tr_logres;
> > +
> > + /*
> > + * Maximum overhead to complete an atomic write ioend in software:
> > + * remove data fork extent + remove cow fork extent +
> > + * map extent into data fork
> > + */
> > + unsigned int atomic_logitems =
> > + (bui + cui + rui + efi) + (cui + rui) + (bui + rui);
>
> This seems wrong. Unmap from the data fork only logs a (bui + cui)
> pair, we don't log a RUI or an EFI until the transaction that
> processes the BUI or CUI actually frees an extent from the the BMBT
> or removes a block from the refcount btree.
This is tricky -- the first transaction in the chain creates a BUI and a
CUI and that's all it needs.
Then we roll to finish the BUI. The second transaction needs space for
the BUD, an RUI, and enough space to relog the CUI (== CUI + CUD).
Then we roll again to finish the RUI. This third transaction needs
space for the RUD and space to relog the CUI.
Roll again, fourth transaction needs space for the CUD and possibly a
new EFI.
Roll again, fifth transaction needs space for an EFD.
const unsigned int efi = xfs_efi_log_space(1);
const unsigned int efd = xfs_efd_log_space(1);
const unsigned int rui = xfs_rui_log_space(1);
const unsigned int rud = xfs_rud_log_space();
const unsigned int cui = xfs_cui_log_space(1);
const unsigned int cud = xfs_cud_log_space();
const unsigned int bui = xfs_bui_log_space(1);
const unsigned int bud = xfs_bud_log_space();
/*
* Maximum overhead to complete an atomic write ioend in software:
* remove data fork extent + remove cow fork extent + map extent into
* data fork.
*
* tx0: Creates a BUI and a CUI and that's all it needs.
*
* tx1: Roll to finish the BUI. Need space for the BUD, an RUI, and
* enough space to relog the CUI (== CUI + CUD).
*
* tx2: Roll again to finish the RUI. Need space for the RUD and space
* to relog the CUI.
*
* tx3: Roll again, need space for the CUD and possibly a new EFI.
*
* tx4: Roll again, need space for an EFD.
*/
const unsigned int tx0 = bui + cui;
const unsigned int tx1 = bud + rui + cui + cud;
const unsigned int tx2 = rud + cui + cud;
const unsigned int tx3 = cud + efi;
const unsigned int tx4 = efd;
const unsigned int per_intent = max3(max3(tx0, tx1, tx2), tx3, tx4);
> We also need to be able to relog all the intents and everything that
> was modified, so we effectively have at least one
> xfs_allocfree_block_count() reservation needed here as well. Even
> finishing an invalidation BUI can result in BMBT block allocation
> occurring if the operation splits an existing extent record and the
> insert of the new record causes a BMBT block split....
Agreed.
> > +
> > + /* atomic write limits are always a power-of-2 */
> > + return rounddown_pow_of_two(logres / (2 * atomic_logitems));
>
> What is the magic 2 in that division?
That's handwaving the lack of a computation involving
xfs_allocfree_block_count. A better version would be to figure out the
log space needed:
/* Overhead to finish one step of each intent item type */
const unsigned int f1 = libxfs_calc_finish_efi_reservation(mp, 1);
const unsigned int f2 = libxfs_calc_finish_rui_reservation(mp, 1);
const unsigned int f3 = libxfs_calc_finish_cui_reservation(mp, 1);
const unsigned int f4 = libxfs_calc_finish_bui_reservation(mp, 1);
/* We only finish one item per transaction in a chain */
const unsigned int step_size = max(f4, max3(f1, f2, f3));
and then you have what you need to figure out the logres needed to
support a given awu_max, or vice versa:
if (desired_max) {
dbprintf(
"desired_max: %u\nstep_size: %u\nper_intent: %u\nlogres: %u\n",
desired_max, step_size, per_intent,
(desired_max * per_intent) + step_size);
} else if (logres) {
dbprintf(
"logres: %u\nstep_size: %u\nper_intent: %u\nmax_awu: %u\n",
logres, step_size, per_intent,
logres >= step_size ? (logres - step_size) / per_intent : 0);
}
I hacked this into xfs_db so that I could compute a variety of
scenarios. Let's pretend I have a 10T filesystem with 4k fsblocks and
the default configuration.
# xfs_db /dev/mapper/moo -c logres -c "untorn_max -b $(( (16 * 1048576) / 4096 ))" -c "untorn_max -l 244216"
type 0 logres 244216 logcount 5 flags 0x4
type 1 logres 428928 logcount 5 flags 0x4
<snip>
minlogsize logres 648576 logcount 8
To emulate a 16MB untorn write, you'd need a logres of:
desired_max: 4096
step_size: 107520
per_intent: 208
logres: 959488
959488 > 648576, which would alter the minlogsize calculation. I know
we banned tiny filesystems a few years ago, but there's still a risk in
increasing it.
For the tr_write logres that John picked, the awu_max we can emulate is:
logres: 244216
step_size: 107520
per_intent: 208
max_awu: 657
That's enough to emulate a 2MB untorn write.
Now let's try again with a realtime filesystem, because the log
reservations are absurdly huge there:
# xfs_db /dev/mapper/moo -c logres -c "untorn_max -b $(( (16 * 1048576) / 4096 ))" -c "untorn_max -l 417400"
type 0 logres 417400 logcount 5 flags 0x4
type 1 logres 772736 logcount 5 flags 0x4
<snip>
minlogsize logres 772736 logcount 5
For 16MB, you'd need a logres of:
desired_max: 4096
step_size: 107520
per_intent: 208
logres: 959488
959488 > 772736, which would alter the minlogsize calculation.
For the tr_write logres that John picked, the awu_max we can emulate is:
logres: 417400
step_size: 107520
per_intent: 208
max_awu: 1489
This is more than enough to handle a 4MB atomic write without affecting
any the other filesystem geometry. Now I'll try a 400MB filesystem:
# xfs_db /dev/mapper/moo -c logres -c "untorn_max -b $(( (16 * 1048576) / 4096 ))" -c "untorn_max -l 142840"
type 0 logres 142840 logcount 5 flags 0x4
type 1 logres 226176 logcount 5 flags 0x4
<snip>
minlogsize logres 445824 logcount 8
desired_max: 4096
step_size: 56832
per_intent: 208
logres: 908800
Definitely still above minlogsize.
logres: 142840
step_size: 56832
per_intent: 208
max_awu: 413
We can still simulate a 1MB untorn write even on a tiny filesystem.
On a 1k fsblock 400MB filesystem:
# xfs_db /dev/mapper/moo -c logres -c "untorn_max -b $(( (16 * 1048576) / 4096 ))" -c "untorn_max -l 63352"
type 0 logres 63352 logcount 5 flags 0x4
type 1 logres 103296 logcount 5 flags 0x4
<snip>
minlogsize logres 153984 logcount 8
desired_max: 4096
step_size: 26112
per_intent: 208
logres: 878080
Again we see that we'd have to increase the min logsize to support a
16mB untorn write.
logres: 63352
step_size: 26112
per_intent: 208
max_awu: 179
But we can still emulate a 128K untorn write in software.
This is why I don't agree with adding a static 16MB limit -- we clearly
don't need it to emulate current hardware, which can commit up to 64k
atomically. Future hardware can increase that by 64x and we'll still be
ok with using the existing tr_write transaction type.
By contrast, adding a 16MB limit would result in a much larger minimum
log size. If we add that to struct xfs_trans_resv for all filesystems
then we run the risk of some ancient filesystem with a 12M log failing
suddenly failing to mount on a new kernel.
I don't see the point.
--D
> > +}
>
> Also this function does not belong in xfs_super.c - that file is for
> interfacing with the VFS layer. Calculating log reservation
> constants at mount time is done in xfs_trans_resv.c - I suspect most
> of the code in this patch should probably be moved there and run
> from xfs_trans_resv_calc()...
>
> -Dave.
> --
> Dave Chinner
> david@...morbit.com
Powered by blists - more mailing lists