linux-kernel - Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZcWCeU0n7zKEPHk5@dread.disaster.area>
Date: Fri, 9 Feb 2024 12:40:09 +1100
From: Dave Chinner <david@...morbit.com>
To: John Garry <john.g.garry@...cle.com>
Cc: "Darrick J. Wong" <djwong@...nel.org>, hch@....de,
	viro@...iv.linux.org.uk, brauner@...nel.org, dchinner@...hat.com,
	jack@...e.cz, chandan.babu@...cle.com, martin.petersen@...cle.com,
	linux-kernel@...r.kernel.org, linux-xfs@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, tytso@....edu, jbongio@...gle.com,
	ojaswin@...ux.ibm.com
Subject: Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support

On Wed, Feb 07, 2024 at 02:13:23PM +0000, John Garry wrote:
> On 07/02/2024 00:06, Dave Chinner wrote:
> > > > We really, really don't want to be doing this during allocation
> > > > unless we can avoid it. If the filesystem block size is 64kB, we
> > > > could be allocating up to 96GB per extent, and that becomes an
> > > > uninterruptable write stream inside a transaction context that holds
> > > > inode metadata locked.
> > > Where does that 96GB figure come from?
> > My inability to do math. The actual number is 128GB.
> > 
> > Max extent size = XFS_MAX_BMBT_EXTLEN * fs block size.
> > 	        = 2^21  * fs block size.
> > 
> > So for a 4kB block size filesystem, that's 8GB max extent length,
> > and that's the most we will allocate in a single transaction (i.e.
> > one new BMBT record).
> > 
> > For 64kB block size, we can get 128GB of space allocated in a single
> > transaction.
> 
> atomic write unit max theoretical upper limit is rounddown_power_of_2(2^32 -
> 1) = 2GB
> 
> So this would be what is expected to be the largest extent size requested
> for atomic writes. I am not saying that 2GB is small, but certainly much
> smaller than 128GB.

*cough*

Extent size hints.

I'm a little disappointed that after all these discussions about how
we decouple extent allocation size and alignment from the user IO
size and alignment with things like extent size hints, force align,
etc that you are still thinking that user IO size and alignment
directly drives extent allocation size and alignment....


> > > > > > Why do we want to write zeroes to the disk if we're allocating space
> > > > > > even if we're not sending an atomic write?
> > > > > > 
> > > > > > (This might want an explanation for why we're doing this at all -- it's
> > > > > > to avoid unwritten extent conversion, which defeats hardware untorn
> > > > > > writes.)
> > > > > It's to handle the scenario where we have a partially written extent, and
> > > > > then try to issue an atomic write which covers the complete extent.
> > > > When/how would that ever happen with the forcealign bits being set
> > > > preventing unaligned allocation and writes?
> > > Consider this scenario:
> > > 
> > > # mkfs.xfs -r rtdev=/dev/sdb,extsize=64K -d rtinherit=1 /dev/sda
> > > # mount /dev/sda mnt -o rtdev=/dev/sdb
> > > # touch  mnt/file
> > > # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file # direct IO, atomic
> > > write, 4096B at pos 0
> > Please don't write one-off custom test programs to issue IO - please
> > use and enhance xfs_io so the test cases can then be put straight
> > into fstests without adding yet another "do some minor IO variant"
> > test program. This also means you don't need a random assortment of
> > other tools.
> > 
> > i.e.
> > 
> > # xfs_io -dc "pwrite -VA 0 4096" /root/mnt/file
> > 
> > Should do an RWF_ATOMIC IO, and
> > 
> > # xfs_io -dc "pwrite -VAD 0 4096" /root/mnt/file
> > 
> > should do an RWF_ATOMIC|RWF_DSYNC IO...
> > 
> > 
> > > # filefrag -v mnt/file
> > xfs_io -c "fiemap" mnt/file
> 
> Fine, but I like using something generic for accessing block devices and
> also other FSes. I didn't think that xfs_io can do that.

Yes, it can. We use it extensively in fstests because it works
for any filesystem, not just XFS.

> Anyway, we can look to add atomic write support to xfs_io and any other
> xfs-progs

Please do, then the support is there for developers, users and
fstests without needing to write their own custom test programs.

> > > Filesystem type is: 58465342
> > > File size of mnt/file is 4096 (1 block of 4096 bytes)
> > >    ext:     logical_offset:        physical_offset: length:   expected:
> > > flags:
> > >      0:        0..       0:         24..        24:      1:
> > > last,eof
> > > mnt/file: 1 extent found
> > > # /test-pwritev2 -a -d -l 16384 -p 0 /root/mnt/file
> > > wrote -1 bytes at pos 0 write_size=16384
> > > #
> > Whole test as one repeatable command:
> > 
> > # xfs_io -d -c "truncate 0" -c "chattr +r" \
> > 	-c "pwrite -VAD 0 4096" \
> > 	-c "fiemap" \
> > 	-c "pwrite -VAD 0 16384" \
> > 	/mnt/root/file
> > > For the 2nd write, which would cover a 16KB extent, the iomap code will iter
> > > twice and produce 2x BIOs, which we don't want - that's why it errors there.
> > Yes, but I think that's a feature.  You've optimised the filesystem
> > layout for IO that is 64kB sized and aligned IO, but your test case
> > is mixing 4kB and 16KB IO. The filesystem should be telling you that
> > you're doing something that is sub-optimal for it's configuration,
> > and refusing to do weird overlapping sub-rtextsize atomic IO is a
> > pretty good sign that you've got something wrong.
> 
> Then we really end up with a strange behavior for the user. I mean, the user
> may ask - "why did this 16KB atomic write pass and this one fail? I'm
> following all the rules", and then "No one said not to mix write sizes or
> not mix atomic and non-atomic writes, so should be ok. Indeed, that earlier
> 4K write for the same region passed".
> 
> Playing devil's advocate here, at least this behavior should be documented.

That's what man pages are for, yes?

Are you expecting your deployments to be run on highly suboptimal
configurations and so the code needs to be optimised for this
behaviour, or are you expecting them to be run on correctly
configured systems which would never see these issues?


> > The whole reason for rtextsize existing is to optimise the rtextent
> > allocation to the typical minimum IO size done to that volume. If
> > all your IO is sub-rtextsize size and alignment, then all that has
> > been done is forcing the entire rt device IO into a corner it was
> > never really intended nor optimised for.
> 
> Sure, but just because we are optimized for a certain IO write size should
> not mean that other writes are disallowed or quite problematic.

Atomic writes are just "other writes". They are writes that are
*expected to fail* if they cannot be done atomically.

Application writers will quickly learn how to do sane, fast,
reliable atomic write IO if we reject anything that is going to
requires some complex, sub-optimal workaround in the kernel to make
it work. The simplest solution is to -fail the write-, because
userspace *must* be prepared for *any* atomic write to fail.

> > Why should we jump through crazy hoops to try to make filesystems
> > optimised for large IOs with mismatched, overlapping small atomic
> > writes?
> 
> As mentioned, typically the atomic writes will be the same size, but we may
> have other writes of smaller size.

Then we need the tiny write to allocate and zero according to the
maximum sized atomic write bounds. Then we just don't care about
large atomic IO overlapping small IO, because the extent on disk
aligned to the large atomic IO is then always guaranteed to be the
correct size and shape.


> > > With the change in this patch, instead we have something like this after the
> > > first write:
> > > 
> > > # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file
> > > wrote 4096 bytes at pos 0 write_size=4096
> > > # filefrag -v mnt/file
> > > Filesystem type is: 58465342
> > > File size of mnt/file is 4096 (1 block of 4096 bytes)
> > >    ext:     logical_offset:        physical_offset: length:   expected:
> > > flags:
> > >      0:        0..       3:         24..        27:      4:
> > > last,eof
> > > mnt/file: 1 extent found
> > > #
> > > 
> > > So the 16KB extent is in written state and the 2nd 16KB write would iter
> > > once, producing a single BIO.
> > Sure, I know how it works. My point is that it's a terrible way to
> > go about allowing that second atomic write to succeed.
> I think 'terrible' is a bit too strong a word here.

Doing it anything in a way that a user can DOS the entire filesystem
is *terrible*. No ifs, buts or otherwise.

> Indeed, you suggest to
> manually zero the file to solve this problem, below, while this code change
> does the same thing automatically.

Yes, but I also outlined a way that it can be done automatically
without being terrible. There are multiple options here, I outlined
two different approaches that are acceptible.

> > > > > In this
> > > > > scenario, the iomap code will issue 2x IOs, which is unacceptable. So we
> > > > > ensure that the extent is completely written whenever we allocate it. At
> > > > > least that is my idea.
> > > > So return an unaligned extent, and then the IOMAP_ATOMIC checks you
> > > > add below say "no" and then the application has to do things the
> > > > slow, safe way....
> > > We have been porting atomic write support to some database apps and they
> > > (database developers) have had to do something like manually zero the
> > > complete file to get around this issue, but that's not a good user
> > > experience.
> > Better the application zeros the file when it is being initialised
> > and doesn't have performance constraints rather than forcing the
> > filesystem to do it in the IO fast path when IO performance and
> > latency actually matters to the application.
> 
> Can't we do both? I mean, the well-informed user can still pre-zero the file
> just to ensure we aren't doing this zero'ing with the extent allocation.

I never said we can't do zeroing. I just said that it's normally
better when the application controls zeroing directly.

> > And therein lies the problem.
> > 
> > If you are doing sub-rtextent IO at all, then you are forcing the
> > filesystem down the path of explicitly using unwritten extents and
> > requiring O_DSYNC direct IO to do journal flushes in IO completion
> > context and then performance just goes down hill from them.
> > 
> > The requirement for unwritten extents to track sub-rtextsize written
> > regions is what you're trying to work around with XFS_BMAPI_ZERO so
> > that atomic writes will always see "atomic write aligned" allocated
> > regions.
> > 
> > Do you see the problem here? You've explicitly told the filesystem
> > that allocation is aligned to 64kB chunks, then because the
> > filesystem block size is 4kB, it's allowed to track unwritten
> > regions at 4kB boundaries. Then you do 4kB aligned file IO, which
> > then changes unwritten extents at 4kB boundaries. Then you do a
> > overlapping 16kB IO that*requires*  16kB allocation alignment, and
> > things go BOOM.
> > 
> > Yes, they should go BOOM.
> > 
> > This is a horrible configuration - it is incomaptible with 16kB
> > aligned and sized atomic IO.
> 
> Just because the DB may do 16KB atomic writes most of the time should not
> disallow it from any other form of writes.

That's not what I said. I said the using sub-rtextsize atomic writes
with single FSB unwritten extent tracking is horrible and
incompatible with doing 16kB atomic writes.

This setup will not work at all well with your patches and should go
BOOM. Using XFS_BMAPI_ZERO is hacking around the fact that the setup
has uncoordinated extent allocation and unwritten conversion
granularity.

That's the fundamental design problem with your approach - it allows
unwritten conversion at *minimum IO sizes* and that does not work
with atomic IOs with larger alignment requirements.

The fundamental design principle is this: for maximally sized atomic
writes to always succeed we require every allocation, zeroing and
unwritten conversion operation to use alignments and sizes that are
compatible with the maximum atomic write sizes being used.

i.e. atomic writes need to use max write size granularity for all IO
operations, not filesystem block granularity.

And that also means things like rtextsize and extsize hints need to
match these atomic write requirements, too....

> > Allocation is aligned to 64kB, written
> > region tracking is aligned to 4kB, and there's nothing to tell the
> > filesystem that it should be maintaining 16kB "written alignment" so
> > that 16kB atomic writes can always be issued atomically.
> > 
> > i.e. if we are going to do 16kB aligned atomic IO, then all the
> > allocation and unwritten tracking needs to be done in 16kB aligned
> > chunks, not 4kB. That means a 4KB write into an unwritten region or
> > a hole actually needs to zero the rest of the 16KB range it sits
> > within.
> > 
> > The direct IO code can do this, but it needs extension of the
> > unaligned IO serialisation in XFS (the alignment checks in
> > xfs_file_dio_write()) and the the sub-block zeroing in
> > iomap_dio_bio_iter() (the need_zeroing padding has to span the fs
> > allocation size, not the fsblock size) to do this safely.
> > 
> > Regardless of how we do it, all IO concurrency on this file is shot
> > if we have sub-rtextent sized IOs being done. That is true even with
> > this patch set - XFS_BMAPI_ZERO is done whilst holding the
> > XFS_ILOCK_EXCL, and so no other DIO can map extents whilst the
> > zeroing is being done.
> > 
> > IOWs, anything to do with sub-rtextent IO really has to be treated
> > like sub-fsblock DIO - i.e. exclusive inode access until the
> > sub-rtextent zeroing has been completed.
> 
> I do understand that this is not perfect that we may have mixed block sizes
> being written, but I don't think that we should disallow it and throw an
> error.

Ummmm, did you read what you quoted?

The above is an outline of the IO path modifications that will allow
mixed IO sizes to be used with atomic writes without requiring the
XFS_BMAPI_ZERO hack. It pushes the sub-atomic write alignment
zeroing out to the existing DIO sub-block zeroing, hence ensuring
that we only ever convert unwritten extents on max sized atomic
write boundaries for atomic write enabled inodes.

At no point have I said "no mixed writes". I've said no to the
XFS_BMAPI_ZERO hack, but then I've explained the fundamental issue
that it works around and given you a decent amount of detail on how
to sanely implementing mixed write support that will work (slowly)
with those configurations and IO patterns.

So it's your choice - you can continue to beleive I don't mixed
writes to work at all, or you can go back and try to understand the
IO path changes I've suggested that will allow mixed atomic writes
to work as well as they possibly can....

-Dave.
-- 
Dave Chinner
david@...morbit.com