[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <0c0753fb-8a35-42a6-8698-b141b1e561ca@oracle.com>
Date: Wed, 22 Jan 2025 10:45:34 +0000
From: John Garry <john.g.garry@...cle.com>
To: Christoph Hellwig <hch@....de>, "Darrick J. Wong" <djwong@...nel.org>
Cc: Dave Chinner <david@...morbit.com>, brauner@...nel.org, cem@...nel.org,
dchinner@...hat.com, ritesh.list@...il.com, linux-xfs@...r.kernel.org,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
martin.petersen@...cle.com
Subject: Re: [PATCH 1/4] iomap: Lift blocksize restriction on atomic writes
On 22/01/2025 06:42, Christoph Hellwig wrote:
> On Fri, Jan 17, 2025 at 10:49:34AM -0800, Darrick J. Wong wrote:
>> The trouble is that the br_startoff attribute of cow staging mappings
>> aren't persisted on disk anywhere, which is why exchange-range can't
>> handle the cow fork. You could open an O_TMPFILE and swap between the
>> two files, though that gets expensive per-io unless you're willing to
>> stash that temp file somewhere.
>
> Needing another inode is better than trying to steal ranges from the
> actual inode we're operating on. But we might just need a different
> kind of COW staging for that.
>
>>
>> At this point I think we should slap the usual EXPERIMENTAL warning on
>> atomic writes through xfs and let John land the simplest multi-fsblock
>> untorn write support, which only handles the corner case where all the
>> stars are <cough> aligned; and then make an exchange-range prototype
>> and/or all the other forcealign stuff.
>
> That is the worst of all possible outcomes. Combing up with an
> atomic API that fails for random reasons only on aged file systems
> is literally the worst thing we can do. NAK.
>
>
I did my own quick PoC to use CoW for misaligned blocks atomic writes
fallback.
I am finding that the block allocator is often giving misaligned blocks
wrt atomic write length, like this:
# xfs_bmap -v mnt/file
mnt/file:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..20479]: 192..20671 0 (192..20671) 20480 000000
#
#
#xfs_io -d -C "pwrite -b 64k -V 1 -A -D 0 64k" mnt/file
#xfs_bmap -v mnt/file
mnt/file:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..127]: 20672..20799 0 (20672..20799) 128 000000
1: [128..20479]: 320..20671 0 (320..20671) 20352 000000
#
#xfs_io -d -C "pwrite -b 64k -V 1 -A -D 0 64k" mnt/file
#xfs_bmap -v mnt/file
mnt/file:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..127]: 20928..21055 0 (20928..21055) 128 000000
1: [128..20479]: 320..20671 0 (320..20671) 20352 000000
In this case we would not use HW offload (as no start blocks are
64K-aligned), which will affect performance.
Since we are not considering forcealign ATM, can we still consider some
other alignment hint to the block allocator? It could be similar to how
stripe alignment is handled.
Some other thoughts:
- I am not sure what atomic write unit max we would now use.
- Anything written back with CoW/exchange range will need FUA to ensure
that the write is fully persisted. Otherwise I think that not using FUA
could mean that the data is reported written by the disk but may only be
partially persisted from a power fail later.
Powered by blists - more mailing lists