linux-kernel - Re: [PATCH v7 04/12] ltp/fsx.c: Add atomic writes support to fsx

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251021174406.GR6178@frogsfrogsfrogs>
Date: Tue, 21 Oct 2025 10:44:06 -0700
From: "Darrick J. Wong" <djwong@...nel.org>
To: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
Cc: Brian Foster <bfoster@...hat.com>, John Garry <john.g.garry@...cle.com>,
	Zorro Lang <zlang@...hat.com>, fstests@...r.kernel.org,
	Ritesh Harjani <ritesh.list@...il.com>, tytso@....edu,
	linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org,
	linux-ext4@...r.kernel.org
Subject: Re: [PATCH v7 04/12] ltp/fsx.c: Add atomic writes support to fsx

On Tue, Oct 21, 2025 at 05:28:32PM +0530, Ojaswin Mujoo wrote:
> On Tue, Oct 21, 2025 at 07:30:32AM -0400, Brian Foster wrote:
> > On Tue, Oct 21, 2025 at 03:58:23PM +0530, Ojaswin Mujoo wrote:
> > > On Mon, Oct 20, 2025 at 11:33:40AM +0100, John Garry wrote:
> > > > On 06/10/2025 14:20, Ojaswin Mujoo wrote:
> > > > > Hi Zorro, thanks for checking this. So correct me if im wrong but I
> > > > > understand that you have run this test on an atomic writes enabled
> > > > > kernel where the stack also supports atomic writes.
> > > > > 
> > > > > Looking at the bad data log:
> > > > > 
> > > > > 	+READ BAD DATA: offset = 0x1c000, size = 0x1803, fname = /mnt/xfstests/test/junk
> > > > > 	+OFFSET      GOOD    BAD     RANGE
> > > > > 	+0x1c000     0x0000  0xcdcd  0x0
> > > > > 	+operation# (mod 256) for the bad data may be 205
> > > > > 
> > > > > We see that 0x0000 was expected but we got 0xcdcd. Now the operation
> > > > > that caused this is indicated to be 205, but looking at that operation:
> > > > > 
> > > > > +205(205 mod 256): ZERO     0x6dbe6 thru 0x6e6aa	(0xac5 bytes)
> > > > > 
> > > > > This doesn't even overlap the range that is bad. (0x1c000 to 0x1c00f).
> > > > > Infact, it does seem like an unlikely coincidence that the actual data
> > > > > in the bad range is 0xcdcd which is something xfs_io -c "pwrite" writes
> > > > > to default (fsx writes random data in even offsets and operation num in
> > > > > odd).
> > > > > 
> > > > > I am able to replicate this but only on XFS but not on ext4 (atleast not
> > > > > in 20 runs).  I'm trying to better understand if this is a test issue or
> > > > > not. Will keep you update.
> > > > 
> > > > 
> > > > Hi Ojaswin,
> > > > 
> > > > Sorry for the very slow response.
> > > > 
> > > > Are you still checking this issue?
> > > > 
> > > > To replicate, should I just take latest xfs kernel and run this series on
> > > > top of latest xfstests? Is it 100% reproducible?
> > > > 
> > > > Thanks,
> > > > John
> > > 
> > > Hi John,
> > > 
> > > Yes Im looking into it but I'm now starting to run into some reflink/cow
> > > based concepts that are taking time to understand. Let me share what I
> > > have till now:
> > > 
> > > So the test.sh that I'm using can be found here [1] which just uses an
> > > fsx replay file (which replays all operations) present in the same repo
> > > [2]. If you see the replay file, there are a bunch of random operations
> > > followed by the last 2 commented out operations:
> > > 
> > > # copy_range 0xd000 0x1000 0x1d800 0x44000   <--- # operations <start> <len> <dest of copy> <filesize (can be ignored)>
> > > # mapread 0x1e000 0x1000 0x1e400 *
> > > 
> > > The copy_range here is the one which causes (or exposes) the corruption
> > > at 0x1e800 (the end of copy range destination gets corrupted).
> > > 
> > > To have more control, I commented these 2 operations and am doing it by
> > > hand in the test.sh file, with xfs_io. I'm also using a non atomic write
> > > device so we only have S/W fallback.
> > > 
> > > Now some observations:
> > > 
> > > 1. The copy_range operations is actually copying from a hole to a hole,
> > > so we should be reading all 0s. But What I see is the following happening:
> > > 
> > >   vfs_copy_file_range
> > >    do_splice_direct
> > >     do_splice_direct_actor
> > >      do_splice_read
> > >        # Adds the folio at src offset to the pipe. I confirmed this is all 0x0.
> > >      splice_direct_to_actor
> > >       direct_splice_actor
> > >        do_splice_from
> > >         iter_file_splice_write
> > >          xfs_file_write_iter
> > >           xfs_file_buffered_write
> > >            iomap_file_buferred_write
> > >             iomap_iter
> > >              xfs_buferred_write_iomap_begin
> > >                # Here we correctly see that there is noting at the
> > >                # destination in data fork, but somehow we find a mapped
> > >                # extent in cow fork which is returned to iomap.
> > >              iomap_write_iter
> > >               __iomap_write_begin
> > >                 # Here we notice folio is not uptodate and call
> > >                 # iomap_read_folio_range() to read from the cow_fork
> > >                 # mapping we found earlier. This results in folio having
> > >                 # incorrect data at 0x1e800 offset.
> > > 
> > >  So it seems like the fsx operations might be corrupting the cow fork state
> > >  somehow leading to stale data exposure. 
> > > 
> > > 2. If we disable atomic writes we dont hit the issue.
> > > 
> > > 3. If I do a -c pread of the destination range before doing the
> > > copy_range operation then I don't see the corruption any more.

Yeah, I stopped seeing failures after adding -X (verify data after every
operation) to FSX_AVOID.

> > > I'm now trying to figure out why the mapping returned is not IOMAP_HOLE
> > > as it should be. I don't know the COW path in xfs so there are some gaps
> > > in my understanding. Let me know if you need any other information since
> > > I'm reliably able to replicate on 6.17.0-rc4.
> > > 
> > 
> > I haven't followed your issue closely, but just on this hole vs. COW
> > thing, XFS has a bit of a quirk where speculative COW fork preallocation
> > can expand out over holes in the data fork. If iomap lookup for buffered
> > write sees COW fork blocks present, it reports those blocks as the
> > primary mapping even if the data fork happens to be a hole (since
> > there's no point in allocating blocks to the data fork when we can just
> > remap).

That sounds like a bug -- if a sub-fsblock write to an uncached file
range has to read data in from disk, then xfs needs to pass the data
fork mapping to iomap even if it's a read.

Can you capture the ftrace output of the iomap_iter_*map tracepoints?

> > Again I've no idea if this relates to your issue or what you're
> > referring to as a hole (i.e. data fork only?), but just pointing it out.
> > The latest iomap/xfs patches I posted a few days ago kind of dance
> > around this a bit, but I was somewhat hoping that maybe the cleanups
> > there would trigger some thoughts on better iomap reporting in that
> > regard.
> 
> Hi Brian, Thanks for the details and yes by "hole" i did mean hole in
> data fork only. The part that I'm now confused about is does this sort
> of preallocation extent hold any valid data? IIUC it should not, so I

No.  Mappings in the cow fork are not fully written and should never be
used for reads.

> would expect it to trigger iomap_block_needs_zeroing() to write zeroes
> to the folio. Instead, what I see in the issue is that we are trying to
> do disk read.

Hrm.  Part of the problem here might be that iomap_read_folio_range
ignores iomap_iter::srcmap if it's type IOMAP_HOLE (see
iomap_iter_srcmap), even if the filesystem actually *set* the srcmap to
a hole.

FWIW I see a somewhat different failure -- not data corruption, but
pwrite returning failure:

--- /run/fstests/bin/tests/generic/521.out      2025-07-15 14:45:15.100315255 -0700
+++ /var/tmp/fstests/generic/521.out.bad        2025-10-21 10:33:39.032263811 -0700
@@ -1,2 +1,668 @@
 QA output created by 521
+dowrite: write: Input/output error
+LOG DUMP (661 total operations):
+1(  1 mod 256): TRUNCATE UP    from 0x0 to 0x1d000
+2(  2 mod 256): DEDUPE 0x19000 thru 0x1bfff    (0x3000 bytes) to 0x13000 thru 0x15fff
+3(  3 mod 256): SKIPPED (no operation)
+4(  4 mod 256): PUNCH    0x5167 thru 0x12d1c   (0xdbb6 bytes)
+5(  5 mod 256): WRITE    0x79000 thru 0x86fff  (0xe000 bytes) HOLE
+6(  6 mod 256): PUNCH    0x32344 thru 0x36faf  (0x4c6c bytes)
+7(  7 mod 256): READ     0x0 thru 0xfff        (0x1000 bytes)
+8(  8 mod 256): WRITE    0xe000 thru 0x11fff   (0x4000 bytes)
+9(  9 mod 256): PUNCH    0x71324 thru 0x86fff  (0x15cdc bytes)
+10( 10 mod 256): MAPREAD  0x5b000 thru 0x6d218 (0x12219 bytes)
+11( 11 mod 256): COLLAPSE 0x70000 thru 0x79fff (0xa000 bytes)
+12( 12 mod 256): WRITE    0x41000 thru 0x50fff (0x10000 bytes)
+13( 13 mod 256): INSERT 0x39000 thru 0x4dfff   (0x15000 bytes)
+14( 14 mod 256): WRITE    0x34000 thru 0x37fff (0x4000 bytes)
+15( 15 mod 256): MAPREAD  0x55000 thru 0x6ee44 (0x19e45 bytes)
+16( 16 mod 256): READ     0x46000 thru 0x55fff (0x10000 bytes)
+17( 17 mod 256): PUNCH    0x1ccea thru 0x23b2e (0x6e45 bytes)
+18( 18 mod 256): COPY 0x2a000 thru 0x35fff     (0xc000 bytes) to 0x52000 thru 0x5dfff
+19( 19 mod 256): SKIPPED (no operation)
+20( 20 mod 256): WRITE    0x10000 thru 0x1ffff (0x10000 bytes)
<snip>
+645(133 mod 256): READ     0x5000 thru 0x16fff (0x12000 bytes)
+646(134 mod 256): PUNCH    0x3a51d thru 0x41978        (0x745c bytes)
+647(135 mod 256): FALLOC   0x47f4c thru 0x54867        (0xc91b bytes) INTERIOR
+648(136 mod 256): WRITE    0xa000 thru 0x1dfff (0x14000 bytes)
+649(137 mod 256): CLONE 0x83000 thru 0x89fff   (0x7000 bytes) to 0x4b000 thru 0x51fff
+650(138 mod 256): TRUNCATE DOWN        from 0x8bac4 to 0x7e000
+651(139 mod 256): MAPWRITE 0x13000 thru 0x170e6        (0x40e7 bytes)
+652(140 mod 256): XCHG 0x6a000 thru 0x7cfff    (0x13000 bytes) to 0x8000 thru 0x1afff
+653(141 mod 256): XCHG 0x35000 thru 0x3cfff    (0x8000 bytes) to 0x1b000 thru 0x22fff
+654(142 mod 256): CLONE 0x47000 thru 0x60fff   (0x1a000 bytes) to 0x65000 thru 0x7efff
+655(143 mod 256): DEDUPE 0x79000 thru 0x7dfff  (0x5000 bytes) to 0x6e000 thru 0x72fff
+656(144 mod 256): XCHG 0x4d000 thru 0x5ffff    (0x13000 bytes) to 0x8000 thru 0x1afff
+657(145 mod 256): PUNCH    0x7194f thru 0x7efff        (0xd6b1 bytes)
+658(146 mod 256): PUNCH    0x7af7e thru 0x7efff        (0x4082 bytes)
+659(147 mod 256): MAPREAD  0x77000 thru 0x7e55d        (0x755e bytes)
+660(148 mod 256): READ     0x58000 thru 0x64fff        (0xd000 bytes)
+661(149 mod 256): WRITE    0x88000 thru 0x8bfff        (0x4000 bytes) HOLE
+Log of operations saved to "/mnt/junk.fsxops"; replay with --replay-ops
+Correct content saved for comparison
+(maybe hexdump "/mnt/junk" vs "/mnt/junk.fsxgood")

Curiously there are no EIO errors logged in dmesg.

--D

> Regards,
> ojaswin
> > 
> > Brian
> 
> > 
> > > [1]
> > > https://github.com/OjaswinM/fsx-aw-issue/tree/master
> > > 
> > > [2] https://github.com/OjaswinM/fsx-aw-issue/blob/master/repro.fsxops
> > > 
> > > regards,
> > > ojaswin
> > > 
> > 
>