linux-kernel - Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zc1GwE/7QJisKZCX@dread.disaster.area>
Date: Thu, 15 Feb 2024 10:03:28 +1100
From: Dave Chinner <david@...morbit.com>
To: John Garry <john.g.garry@...cle.com>
Cc: "Darrick J. Wong" <djwong@...nel.org>, hch@....de,
	viro@...iv.linux.org.uk, brauner@...nel.org, dchinner@...hat.com,
	jack@...e.cz, chandan.babu@...cle.com, martin.petersen@...cle.com,
	linux-kernel@...r.kernel.org, linux-xfs@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, tytso@....edu, jbongio@...gle.com,
	ojaswin@...ux.ibm.com
Subject: Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support

On Wed, Feb 14, 2024 at 11:06:11AM +0000, John Garry wrote:
> 
> > 
> > > Setting the rtvol extsize at mkfs time or enabling atomic writes
> > > FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do in
> > > terms of atomic writes.
> > 
> > Which is wrong. mkfs.xfs gets physical information about the volume
> > from the kernel and makes the filesystem accounting to that
> > information. That's how we do stripe alignment, sector sizing, etc.
> > Atomic write support and setting up alignment constraints should be
> > no different.
> 
> Yes, I was just looking at adding a mkfs option to format for atomic writes,
> which would check physical information about the volume and whether it suits
> rtextsize and then subsequently also set XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES.

FWIW, atomic writes need to be implemented in XFS in a way that
isn't specific to the rtdev. There is no reason they cannot be
applied to the data device (via superblock max atomic write size
field or extent size hints and the align flag) so
please don't get hung up on rtextsize as the only thing that atomic
write alignment might apply to.

> > Yes, mkfs allows the user to override the hardware configsi it
> > probes, but it also warns when the override is doing something
> > sub-optimal (like aligning all AG headers to the same disk in a
> > stripe).
> > 
> > IOWs, mkfs should be pulling this atomic write info from the
> > hardware and configuring the filesysetm around that information.
> > That's the target we should be aiming the kernel implementation at
> > and optimising for - a filesystem that is correctly configured
> > according to published hardware capability.
> 
> Right
> 
> So, for example, if the atomic writes option is set and the rtextsize set by
> the user is so much larger than what HW can support in terms of atomic
> writes, then we should let the user know about this.

Well, this is part of the problem I mention above: you're focussing
entirely on the rtdev setup and not the general "atomic writes
require BMBT extent alignment constraints".

So, maybe, yes, we might want to warn that the rtextsize is much
bigger than the atomic write size of that device, but now there's
something else we need to take into account: the rtdev could have a
different atomic write size comxpapred to the data device.

What now?

IOWs, focussing on the rtdev misses key considerations for making
the functionality generic, and we most definitely don't want to have
to rev the on disk format a second time to support atomic writes
for the data device. Hence we likely need two variables for atomic
write sizes in the superblock - one for the rtdev, and one for the
data device. And this then feeds through to Darrick's multi-rtdev
stuff - each rtdev will need to have an attribute that tracks this
information.

Actually, now that I think about it, we may require 3 sizes - I'm in
the process of writing a design doc for an new journal format, and
one of the things I want it to be able to do is use atomic writes to
enable journal writes to be decoupled from device sector sizes. If
we are using atomic writes > sector size for the journal, then we
definitely need to record somewhere in the superblock what the
minimum journal IO size is because it isn't going to be the sector
size anymore...

[...]

> > > If so, then my
> > > concern is that the bdev atomic write upper limit is not fixed. This can
> > > solved, but I would still like to be clear on this max atomic write size.
> > 
> > Right, we haven't clearly defined how XFS is supposed align BMBT
> > operations in a way that is compatible for atomic write operations.
> > 
> > What the patchset does is try to extend and infer things from
> > existing allocation alignment constraints, but then not apply those
> > constraints to critical extent state operations (pure BMBT
> > modifications) that atomic writes also need constrained to work
> > correctly and efficiently.
> 
> Right. Previously I also did mention that we could explicitly request the
> atomic write size per-inode, but a drawback is that this would require an
> on-disk format change.

We're already having to change the on-disk format for this (inode
flag, superblock feature bit), so we really should be trying to make
this generic and properly featured so that it can be used for
anything that requires physical alignment of file data extents, not
just atomic writes...

> > > > i.e. atomic writes need to use max write size granularity for all IO
> > > > operations, not filesystem block granularity.
> > > > 
> > > > And that also means things like rtextsize and extsize hints need to
> > > > match these atomic write requirements, too....
> > > 
> > > As above, I am not 100% sure if you mean these to be the atomic write
> > > maximal value.
> > 
> > Yes, they either need to be the same as the atomic write max value
> > or, much better, once a hint has been set, then resultant size is
> > then aligned up to be compatible with the atomic write constraints.
> > 
> > e.g. set an inode extent size hint of 960kB on a device with 128kB
> > atomic write capability. If the inode has the atomic write flag set,
> > then allocations need to round the extent size up from 960kB to 1MB
> > so that the BMBT extent layout and alignment is compatible with 128kB
> > atomic writes.
> > 
> > At this point, zeroing, punching, unwritten extent conversion, etc
> > also needs to be done on aligned 128kB ranges to be comaptible with
> > atomic writes, rather than filesysetm block boundaries that would
> > normally be used if just the extent size hint was set.
> > 
> > This is effectively what the original "force align" inode flag bit
> > did - it told the inode that all BMBT manipulations need to be
> > extent size hint aligned, not just allocation. It's a generic flag
> > that implements specific extent manipulation constraints that happen
> > to be compatible with atomic writes when the right extent size hint
> > is set.
> > 
> > So at this point, I'm not sure that we should have an "atomic
> > writes" flag in the inode.
> 
> Another motivation for this flag is that we can explicitly enable some
> software-based atomic write support for an inode when the backing device
> does not have HW support.

That's orthogonal to the aligment issue. If the BMBT extents are
always aligned in a way that is compatible with atomic writes, we
don't need and aomtic writes flag to tell the filesystem it should
do an atomic write. That comes from userspace via the IOCB_ATOMIC
flag.

It is the IOCB_ATOMIC that triggers the software fallback when the
hardware doesn't support atomic writes, not an inode flag. All the
filesystem has to do is guarantee all extent manipulations are
correctly aligned and IOCB_ATOMIC can always be executed regardless
of whether it is hardware or software that does it.


> In addition, in this series setting FS_XFLAG_ATOMICWRITES means
> XFS_DIFLAG2_ATOMICWRITES gets set, and I would expect it to do something
> similar for other OSes, and for those other OSes it may also mean some other
> special alignment feature enabled. We want a consistent user experience.

I don't care about other OSes - they don't implement the
FS_IOC_FSSETXATTR interfaces, so we just don't care about cross-OS
compatibility for the user API.

Fundamentally, atomic writes are *not a property of the filesystem
on-disk format*. They require extent tracking constraints (i.e.
alignment), and that's the property of the filesystem on-disk format
that we need to manipulate here.

Users are not going to care if the setup ioctl for atomic writes
is to set FS_XFLAG_ALIGN_EXTENTS vs FS_XFLAG_ATOMICWRITES. They
already know they have to align their IO properly for atomic writes,
so it's not like this is something they will be completely
unfamiliar with.

Indeed, FS_XFLAG_ALIGN_EXTENTS | FS_XFLAG_EXTSIZE w/ fsx.fsx_extsize
= max_atomic_write_size as a user interface to set up the inode for 
atomic writes is pretty concise and easy to use. It also isn't
specific to atomic writes, and so this fucntionality can be used for
anything that needs aligned extent manipulation for performant
functionality.

> > to behave in a particular way - forced alignment - not that atomic
> > writes are being used in the filesystem....
> > 
> > At this point, the filesystem can do all the extent modification
> > alignment stuff that atomic writes require without caring if the
> > block device supports atomic writes or even if the
> > application is using atomic writes.
> > 
> > This means we can test the BMBT functionality in fstests without
> > requiring hardware (or emulation) that supports atomic writes - all
> > we need to do is set the forced align flag, an extent size hint and
> > go run fsx on it...
> > 
> 
> The current idea was that the forcealign feature would be required for the
> second phase for atomic write support - non-rtvol support. Darrick did send
> that series out separately over the New Year's break.

This is the wrong approach: this needs to be integrated into the
same patchset so we can review the changes for atomic writes as a
whole, not as two separate, independent on-disk format changes. The
on-disk format change that atomic writes need is aligned BMBT extent
manipulations, regardless of whether we are targeting the rtdev or
the datadev, and trying to separate them is leading you down
entirely the wrong path.

-Dave.
-- 
Dave Chinner
david@...morbit.com