[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c0d1d513-991b-4893-a132-66ee02c0c880@oracle.com>
Date: Thu, 15 Feb 2024 09:53:47 +0000
From: John Garry <john.g.garry@...cle.com>
To: Dave Chinner <david@...morbit.com>
Cc: "Darrick J. Wong" <djwong@...nel.org>, hch@....de, viro@...iv.linux.org.uk,
brauner@...nel.org, dchinner@...hat.com, jack@...e.cz,
chandan.babu@...cle.com, martin.petersen@...cle.com,
linux-kernel@...r.kernel.org, linux-xfs@...r.kernel.org,
linux-fsdevel@...r.kernel.org, tytso@....edu, jbongio@...gle.com,
ojaswin@...ux.ibm.com
Subject: Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
>>
>> Yes, I was just looking at adding a mkfs option to format for atomic writes,
>> which would check physical information about the volume and whether it suits
>> rtextsize and then subsequently also set XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES.
>
> FWIW, atomic writes need to be implemented in XFS in a way that
> isn't specific to the rtdev. There is no reason they cannot be
> applied to the data device (via superblock max atomic write size
> field or extent size hints and the align flag) so
> please don't get hung up on rtextsize as the only thing that atomic
> write alignment might apply to.
Sure
>
>>> Yes, mkfs allows the user to override the hardware configsi it
>>> probes, but it also warns when the override is doing something
>>> sub-optimal (like aligning all AG headers to the same disk in a
>>> stripe).
>>>
>>> IOWs, mkfs should be pulling this atomic write info from the
>>> hardware and configuring the filesysetm around that information.
>>> That's the target we should be aiming the kernel implementation at
>>> and optimising for - a filesystem that is correctly configured
>>> according to published hardware capability.
>>
>> Right
>>
>> So, for example, if the atomic writes option is set and the rtextsize set by
>> the user is so much larger than what HW can support in terms of atomic
>> writes, then we should let the user know about this.
>
> Well, this is part of the problem I mention above: you're focussing
> entirely on the rtdev setup and not the general "atomic writes
> require BMBT extent alignment constraints".
I'm really just saying what I was considering based on this series only.
>
> So, maybe, yes, we might want to warn that the rtextsize is much
> bigger than the atomic write size of that device, but now there's
> something else we need to take into account: the rtdev could have a
> different atomic write size comxpapred to the data device.
>
> What now?
>
> IOWs, focussing on the rtdev misses key considerations for making
> the functionality generic, and we most definitely don't want to have
> to rev the on disk format a second time to support atomic writes
> for the data device. Hence we likely need two variables for atomic
> write sizes in the superblock - one for the rtdev, and one for the
> data device. And this then feeds through to Darrick's multi-rtdev
> stuff - each rtdev will need to have an attribute that tracks this
> information.
ok
>>>
>>> What the patchset does is try to extend and infer things from
>>> existing allocation alignment constraints, but then not apply those
>>> constraints to critical extent state operations (pure BMBT
>>> modifications) that atomic writes also need constrained to work
>>> correctly and efficiently.
>>
>> Right. Previously I also did mention that we could explicitly request the
>> atomic write size per-inode, but a drawback is that this would require an
>> on-disk format change.
>
> We're already having to change the on-disk format for this (inode
> flag, superblock feature bit), so we really should be trying to make
> this generic and properly featured so that it can be used for
> anything that requires physical alignment of file data extents, not
> just atomic writes...
ok
..
>> Another motivation for this flag is that we can explicitly enable some
>> software-based atomic write support for an inode when the backing device
>> does not have HW support.
>
> That's orthogonal to the aligment issue. If the BMBT extents are
> always aligned in a way that is compatible with atomic writes, we
> don't need and aomtic writes flag to tell the filesystem it should
> do an atomic write.
Any instruction to do an atomic write should be encoded in the userspace
write operation. Or maybe the file open operation in future - I still
get questions about O_ATOMIC.
> That comes from userspace via the IOCB_ATOMIC
> flag.
>
> It is the IOCB_ATOMIC that triggers the software fallback when the
> hardware doesn't support atomic writes, not an inode flag.
To me, any such fallback seems like something which we should be
explicitly enabling.
> All the
> filesystem has to do is guarantee all extent manipulations are
> correctly aligned and IOCB_ATOMIC can always be executed regardless
> of whether it is hardware or software that does it.
>
>
>> In addition, in this series setting FS_XFLAG_ATOMICWRITES means
>> XFS_DIFLAG2_ATOMICWRITES gets set, and I would expect it to do something
>> similar for other OSes, and for those other OSes it may also mean some other
>> special alignment feature enabled. We want a consistent user experience.
>
> I don't care about other OSes - they don't implement the
> FS_IOC_FSSETXATTR interfaces, so we just don't care about cross-OS
> compatibility for the user API.
Other FSes need to support FS_IOC_FSSETXATTR for atomic writes. Or at
least should support it....
>
> Fundamentally, atomic writes are *not a property of the filesystem
> on-disk format*. They require extent tracking constraints (i.e.
> alignment), and that's the property of the filesystem on-disk format
> that we need to manipulate here.
>
> Users are not going to care if the setup ioctl for atomic writes
> is to set FS_XFLAG_ALIGN_EXTENTS vs FS_XFLAG_ATOMICWRITES. They
> already know they have to align their IO properly for atomic writes,
> so it's not like this is something they will be completely
> unfamiliar with.
>
> Indeed, FS_XFLAG_ALIGN_EXTENTS | FS_XFLAG_EXTSIZE w/ fsx.fsx_extsize
> = max_atomic_write_size as a user interface to set up the inode for
> atomic writes is pretty concise and easy to use. It also isn't
> specific to atomic writes, and so this fucntionality can be used for
> anything that needs aligned extent manipulation for performant
> functionality.
This is where there seems to be a difference between how you would like
atomic writes to be enabled and how Christoph would, judging by this:
https://lore.kernel.org/linux-fsdevel/20240110091929.GA31003@lst.de/
I think that if the FS and the userspace util can between them figure
out what to do, then that is ok. This is something like what I proposed
previously:
xfs_io -c "atomic-writes 64K" mnt/file
where the userspace util would use for the FS_IOC_FSSETXATTR ioctl:
FS_XFLAG_ATOMIC_WRITES | FS_XFLAG_ALIGN_EXTENTS | FS_XFLAG_EXTSIZE w/
fsx.fsx_extsize
There is a question on the purpose of the FS_XFLAG_ATOMIC_WRITES flag,
but, to me, it does seem useful for future feature support.
We could just have FS_XFLAG_ATOMIC_WRITES | FS_XFLAG_EXTSIZE w/
fsx.fsx_extsize, and the kernel auto-enables FS_XFLAG_ALIGN_EXTENTS, but
the other way seems better
>
>>> to behave in a particular way - forced alignment - not that atomic
>>> writes are being used in the filesystem....
>>>
>>> At this point, the filesystem can do all the extent modification
>>> alignment stuff that atomic writes require without caring if the
>>> block device supports atomic writes or even if the
>>> application is using atomic writes.
>>>
>>> This means we can test the BMBT functionality in fstests without
>>> requiring hardware (or emulation) that supports atomic writes - all
>>> we need to do is set the forced align flag, an extent size hint and
>>> go run fsx on it...
>>>
>>
>> The current idea was that the forcealign feature would be required for the
>> second phase for atomic write support - non-rtvol support. Darrick did send
>> that series out separately over the New Year's break.
>
> This is the wrong approach: this needs to be integrated into the
> same patchset so we can review the changes for atomic writes as a
> whole, not as two separate, independent on-disk format changes. The
> on-disk format change that atomic writes need is aligned BMBT extent
> manipulations, regardless of whether we are targeting the rtdev or
> the datadev, and trying to separate them is leading you down
> entirely the wrong path.
>
ok, fine.
BTW, going back to the original discussion on the extent zeroing and
your idea to do this in the iomap dio path, my impression is that we
require changes like this:
- in iomap_dio_bio_iter(), we need to zero out the extent according to
extsize (and not just FS block size)
- xfs_dio_write_end_io() -> xfs_iomap_write_unwritten() also needs to
consider this extent being written, and not assume a FS block
- per-inode locking similar to what is done in
xfs_file_dio_write_unaligned() - I need to check that further, though,
as I am not yet sure on how we decide to use this exclusive lock or not.
Does this sound sane?
Thanks,
John
Powered by blists - more mailing lists