linux-kernel - Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <74e13ebf-2bd7-4487-8453-d98d70ba5e68@oracle.com>
Date: Wed, 14 Feb 2024 11:06:11 +0000
From: John Garry <john.g.garry@...cle.com>
To: Dave Chinner <david@...morbit.com>
Cc: "Darrick J. Wong" <djwong@...nel.org>, hch@....de, viro@...iv.linux.org.uk,
        brauner@...nel.org, dchinner@...hat.com, jack@...e.cz,
        chandan.babu@...cle.com, martin.petersen@...cle.com,
        linux-kernel@...r.kernel.org, linux-xfs@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, tytso@....edu, jbongio@...gle.com,
        ojaswin@...ux.ibm.com
Subject: Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support


> 
>> Setting the rtvol extsize at mkfs time or enabling atomic writes
>> FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do in
>> terms of atomic writes.
> 
> Which is wrong. mkfs.xfs gets physical information about the volume
> from the kernel and makes the filesystem accounting to that
> information. That's how we do stripe alignment, sector sizing, etc.
> Atomic write support and setting up alignment constraints should be
> no different.

Yes, I was just looking at adding a mkfs option to format for atomic 
writes, which would check physical information about the volume and 
whether it suits rtextsize and then subsequently also set 
XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES.

> 
> Yes, mkfs allows the user to override the hardware configsi it
> probes, but it also warns when the override is doing something
> sub-optimal (like aligning all AG headers to the same disk in a
> stripe).
> 
> IOWs, mkfs should be pulling this atomic write info from the
> hardware and configuring the filesysetm around that information.
> That's the target we should be aiming the kernel implementation at
> and optimising for - a filesystem that is correctly configured
> according to published hardware capability.

Right

So, for example, if the atomic writes option is set and the rtextsize 
set by the user is so much larger than what HW can support in terms of 
atomic writes, then we should let the user know about this.

> 
> Everything else is in the "make it behave correctly, but we don't
> care if it's slow" category.
> 
>> This check is not done as it is not fixed what the bdev can do in terms of
>> atomic writes - or, more specifically, what they request_queue reports is
>> not be fixed. There are things which can change this. For example, a FW
>> update could change all the atomic write capabilities of a disk. Or even if
>> we swapped a SCSI disk into another host the atomic write limits may change,
>> as the atomic write unit max depends on the SCSI HBA DMA limits. Changing
>> BIO_MAX_VECS - which could come from a kernel update - could also change
>> what we report as atomic write limit in the request queue.
> 
> If that sort of thing happens, then that's too bad. We already have
> these sorts of "do not do if you care about performance"
> constraints. e.g. don't do a RAID restripe that changes the
> alignment/size of the RAID device (e.g. add a single disk and make
> the stripe width wider) because the physical filesystem structure
> will no longer be aligned to the underlying hardware. instead, you
> have to grow striped volumes with compatible stripes in compatible
> sizes to ensure the filesystem remains aligned to the storage...
> 
> We don't try to cater for every single possible permutation of
> storage hardware configurations - that way lies madness. Design and
> optimise for the common case of correctly configured and well
> behaved storage, and everything else we just don't care about beyond
> "don't corrupt or lose data".

ok

> 
>>>>> And therein lies the problem.
>>>>>

..

>>
>> That sounds fine.
>>
>> My question then is how we determine this max atomic write size granularity.
>>
>> We don't explicitly tell the FS what atomic write size we want for a file.
>> Rather we mkfs with some extsize value which should match our atomic write
>> maximal value and then tell the FS we want to do atomic writes on a file,
>> and if this is accepted then we can query the atomic write min and max unit
>> size, and this would be [FS block size, min(bdev atomic write limit,
>> rtexsize)].
>>
>> If rtextsize is 16KB, then we have a good idea that we want 16KB atomic
>> writes support. So then we could use rtextsize as this max atomic write
>> size.
> 
> Maybe, but I think continuing to focus on this as 'atomic writes
> requires' is wrong.
> 
> The filesystem does not carea bout atomic writes. What it cares
> about is the allocation constraints that need to be implemented.
> That constraint is that all BMBT extent operations need to be
> aligned to a specific extent size, not filesystem blocks.
> 
> The current extent size hint (and rtextsize) only impact the
> -allocation of extents-. They are not directly placing constraints
> on the BMBT layout, they are placing constraints on the free space
> search that the allocator runs on the BNO/CNT btrees to select an
> extent that is then inserted into the BMBT.
> 
> The problem is that unwritten extent conversion, truncate, hole
> punching, etc also all need to be correctly aligned for files that
> are configured to support atomic writes. These operations place
> constraints on how the BMBT can modify the existing extent list.
> 
> These are different constraints to what rtextsize/extszhint apply,
> and that's the fundamental behavioural difference between existing
> extent size hint behaviour and the behaviour needed by atomic
> writes.
> 
>> But I am not 100% sure that it your idea (apologies if I am wrong - I
>> am sincerely trying to follow your idea), but rather it would be
>> min(rtextsize, bdev atomic write limit), e.g. if rtextsize was 1MB and bdev
>> atomic write limit is 16KB, then there is no much point in dealing in 1MB
>> blocks for this unwritten extent conversion alignment.
> 
> Exactly my point - there really is no relationship between rtextsize
> and atomic write constraints and that it is a mistake to use
> rtextsize as it stands as a placeholder for atomic write
> constraints.
> 

ok

>> If so, then my
>> concern is that the bdev atomic write upper limit is not fixed. This can
>> solved, but I would still like to be clear on this max atomic write size.
> 
> Right, we haven't clearly defined how XFS is supposed align BMBT
> operations in a way that is compatible for atomic write operations.
> 
> What the patchset does is try to extend and infer things from
> existing allocation alignment constraints, but then not apply those
> constraints to critical extent state operations (pure BMBT
> modifications) that atomic writes also need constrained to work
> correctly and efficiently.

Right. Previously I also did mention that we could explicitly request 
the atomic write size per-inode, but a drawback is that this would 
require an on-disk format change.

> 
>>> i.e. atomic writes need to use max write size granularity for all IO
>>> operations, not filesystem block granularity.
>>>
>>> And that also means things like rtextsize and extsize hints need to
>>> match these atomic write requirements, too....
>>
>> As above, I am not 100% sure if you mean these to be the atomic write
>> maximal value.
> 
> Yes, they either need to be the same as the atomic write max value
> or, much better, once a hint has been set, then resultant size is
> then aligned up to be compatible with the atomic write constraints.
> 
> e.g. set an inode extent size hint of 960kB on a device with 128kB
> atomic write capability. If the inode has the atomic write flag set,
> then allocations need to round the extent size up from 960kB to 1MB
> so that the BMBT extent layout and alignment is compatible with 128kB
> atomic writes.
> 
> At this point, zeroing, punching, unwritten extent conversion, etc
> also needs to be done on aligned 128kB ranges to be comaptible with
> atomic writes, rather than filesysetm block boundaries that would
> normally be used if just the extent size hint was set.
> 
> This is effectively what the original "force align" inode flag bit
> did - it told the inode that all BMBT manipulations need to be
> extent size hint aligned, not just allocation. It's a generic flag
> that implements specific extent manipulation constraints that happen
> to be compatible with atomic writes when the right extent size hint
> is set.
> 
> So at this point, I'm not sure that we should have an "atomic
> writes" flag in the inode. 

Another motivation for this flag is that we can explicitly enable some 
software-based atomic write support for an inode when the backing device 
does not have HW support.

In addition, in this series setting FS_XFLAG_ATOMICWRITES means 
XFS_DIFLAG2_ATOMICWRITES gets set, and I would expect it to do something 
similar for other OSes, and for those other OSes it may also mean some 
other special alignment feature enabled. We want a consistent user 
experience.

> We need to tell BMBT modifications
> to behave in a particular way - forced alignment - not that atomic
> writes are being used in the filesystem....
> 
> At this point, the filesystem can do all the extent modification
> alignment stuff that atomic writes require without caring if the
> block device supports atomic writes or even if the
> application is using atomic writes.
> 
> This means we can test the BMBT functionality in fstests without
> requiring hardware (or emulation) that supports atomic writes - all
> we need to do is set the forced align flag, an extent size hint and
> go run fsx on it...
> 

The current idea was that the forcealign feature would be required for 
the second phase for atomic write support - non-rtvol support. Darrick 
did send that series out separately over the New Year's break.

I think that you wanted to progress the following series first:
https://lore.kernel.org/linux-xfs/20231004001943.349265-1-david@fromorbit.com/

Right?

Thanks,
John