linux-kernel - Re: [PATCH 1/4] iomap: Lift blocksize restriction on atomic writes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <0c0753fb-8a35-42a6-8698-b141b1e561ca@oracle.com>
Date: Wed, 22 Jan 2025 10:45:34 +0000
From: John Garry <john.g.garry@...cle.com>
To: Christoph Hellwig <hch@....de>, "Darrick J. Wong" <djwong@...nel.org>
Cc: Dave Chinner <david@...morbit.com>, brauner@...nel.org, cem@...nel.org,
        dchinner@...hat.com, ritesh.list@...il.com, linux-xfs@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        martin.petersen@...cle.com
Subject: Re: [PATCH 1/4] iomap: Lift blocksize restriction on atomic writes

On 22/01/2025 06:42, Christoph Hellwig wrote:
> On Fri, Jan 17, 2025 at 10:49:34AM -0800, Darrick J. Wong wrote:
>> The trouble is that the br_startoff attribute of cow staging mappings
>> aren't persisted on disk anywhere, which is why exchange-range can't
>> handle the cow fork.  You could open an O_TMPFILE and swap between the
>> two files, though that gets expensive per-io unless you're willing to
>> stash that temp file somewhere.
> 
> Needing another inode is better than trying to steal ranges from the
> actual inode we're operating on.  But we might just need a different
> kind of COW staging for that.
> 
>>
>> At this point I think we should slap the usual EXPERIMENTAL warning on
>> atomic writes through xfs and let John land the simplest multi-fsblock
>> untorn write support, which only handles the corner case where all the
>> stars are <cough> aligned; and then make an exchange-range prototype
>> and/or all the other forcealign stuff.
> 
> That is the worst of all possible outcomes.  Combing up with an
> atomic API that fails for random reasons only on aged file systems
> is literally the worst thing we can do.  NAK.
> 
> 

I did my own quick PoC to use CoW for misaligned blocks atomic writes 
fallback.

I am finding that the block allocator is often giving misaligned blocks 
wrt atomic write length, like this:

# xfs_bmap -v mnt/file
mnt/file:
  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
    0: [0..20479]:      192..20671        0 (192..20671)     20480 000000
#
#
#xfs_io -d -C "pwrite -b 64k -V 1 -A -D 0 64k" mnt/file
#xfs_bmap -v mnt/file
mnt/file:
  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
    0: [0..127]:        20672..20799      0 (20672..20799)     128 000000
    1: [128..20479]:    320..20671        0 (320..20671)     20352 000000
#
#xfs_io -d -C "pwrite -b 64k -V 1 -A -D 0 64k" mnt/file
#xfs_bmap -v mnt/file
mnt/file:
  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
    0: [0..127]:        20928..21055      0 (20928..21055)     128 000000
    1: [128..20479]:    320..20671        0 (320..20671)     20352 000000

In this case we would not use HW offload (as no start blocks are 
64K-aligned), which will affect performance.

Since we are not considering forcealign ATM, can we still consider some 
other alignment hint to the block allocator? It could be similar to how 
stripe alignment is handled.

Some other thoughts:
- I am not sure what atomic write unit max we would now use.
- Anything written back with CoW/exchange range will need FUA to ensure 
that the write is fully persisted. Otherwise I think that not using FUA 
could mean that the data is reported written by the disk but may only be 
partially persisted from a power fail later.