[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251021174406.GR6178@frogsfrogsfrogs>
Date: Tue, 21 Oct 2025 10:44:06 -0700
From: "Darrick J. Wong" <djwong@...nel.org>
To: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
Cc: Brian Foster <bfoster@...hat.com>, John Garry <john.g.garry@...cle.com>,
Zorro Lang <zlang@...hat.com>, fstests@...r.kernel.org,
Ritesh Harjani <ritesh.list@...il.com>, tytso@....edu,
linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-ext4@...r.kernel.org
Subject: Re: [PATCH v7 04/12] ltp/fsx.c: Add atomic writes support to fsx
On Tue, Oct 21, 2025 at 05:28:32PM +0530, Ojaswin Mujoo wrote:
> On Tue, Oct 21, 2025 at 07:30:32AM -0400, Brian Foster wrote:
> > On Tue, Oct 21, 2025 at 03:58:23PM +0530, Ojaswin Mujoo wrote:
> > > On Mon, Oct 20, 2025 at 11:33:40AM +0100, John Garry wrote:
> > > > On 06/10/2025 14:20, Ojaswin Mujoo wrote:
> > > > > Hi Zorro, thanks for checking this. So correct me if im wrong but I
> > > > > understand that you have run this test on an atomic writes enabled
> > > > > kernel where the stack also supports atomic writes.
> > > > >
> > > > > Looking at the bad data log:
> > > > >
> > > > > +READ BAD DATA: offset = 0x1c000, size = 0x1803, fname = /mnt/xfstests/test/junk
> > > > > +OFFSET GOOD BAD RANGE
> > > > > +0x1c000 0x0000 0xcdcd 0x0
> > > > > +operation# (mod 256) for the bad data may be 205
> > > > >
> > > > > We see that 0x0000 was expected but we got 0xcdcd. Now the operation
> > > > > that caused this is indicated to be 205, but looking at that operation:
> > > > >
> > > > > +205(205 mod 256): ZERO 0x6dbe6 thru 0x6e6aa (0xac5 bytes)
> > > > >
> > > > > This doesn't even overlap the range that is bad. (0x1c000 to 0x1c00f).
> > > > > Infact, it does seem like an unlikely coincidence that the actual data
> > > > > in the bad range is 0xcdcd which is something xfs_io -c "pwrite" writes
> > > > > to default (fsx writes random data in even offsets and operation num in
> > > > > odd).
> > > > >
> > > > > I am able to replicate this but only on XFS but not on ext4 (atleast not
> > > > > in 20 runs). I'm trying to better understand if this is a test issue or
> > > > > not. Will keep you update.
> > > >
> > > >
> > > > Hi Ojaswin,
> > > >
> > > > Sorry for the very slow response.
> > > >
> > > > Are you still checking this issue?
> > > >
> > > > To replicate, should I just take latest xfs kernel and run this series on
> > > > top of latest xfstests? Is it 100% reproducible?
> > > >
> > > > Thanks,
> > > > John
> > >
> > > Hi John,
> > >
> > > Yes Im looking into it but I'm now starting to run into some reflink/cow
> > > based concepts that are taking time to understand. Let me share what I
> > > have till now:
> > >
> > > So the test.sh that I'm using can be found here [1] which just uses an
> > > fsx replay file (which replays all operations) present in the same repo
> > > [2]. If you see the replay file, there are a bunch of random operations
> > > followed by the last 2 commented out operations:
> > >
> > > # copy_range 0xd000 0x1000 0x1d800 0x44000 <--- # operations <start> <len> <dest of copy> <filesize (can be ignored)>
> > > # mapread 0x1e000 0x1000 0x1e400 *
> > >
> > > The copy_range here is the one which causes (or exposes) the corruption
> > > at 0x1e800 (the end of copy range destination gets corrupted).
> > >
> > > To have more control, I commented these 2 operations and am doing it by
> > > hand in the test.sh file, with xfs_io. I'm also using a non atomic write
> > > device so we only have S/W fallback.
> > >
> > > Now some observations:
> > >
> > > 1. The copy_range operations is actually copying from a hole to a hole,
> > > so we should be reading all 0s. But What I see is the following happening:
> > >
> > > vfs_copy_file_range
> > > do_splice_direct
> > > do_splice_direct_actor
> > > do_splice_read
> > > # Adds the folio at src offset to the pipe. I confirmed this is all 0x0.
> > > splice_direct_to_actor
> > > direct_splice_actor
> > > do_splice_from
> > > iter_file_splice_write
> > > xfs_file_write_iter
> > > xfs_file_buffered_write
> > > iomap_file_buferred_write
> > > iomap_iter
> > > xfs_buferred_write_iomap_begin
> > > # Here we correctly see that there is noting at the
> > > # destination in data fork, but somehow we find a mapped
> > > # extent in cow fork which is returned to iomap.
> > > iomap_write_iter
> > > __iomap_write_begin
> > > # Here we notice folio is not uptodate and call
> > > # iomap_read_folio_range() to read from the cow_fork
> > > # mapping we found earlier. This results in folio having
> > > # incorrect data at 0x1e800 offset.
> > >
> > > So it seems like the fsx operations might be corrupting the cow fork state
> > > somehow leading to stale data exposure.
> > >
> > > 2. If we disable atomic writes we dont hit the issue.
> > >
> > > 3. If I do a -c pread of the destination range before doing the
> > > copy_range operation then I don't see the corruption any more.
Yeah, I stopped seeing failures after adding -X (verify data after every
operation) to FSX_AVOID.
> > > I'm now trying to figure out why the mapping returned is not IOMAP_HOLE
> > > as it should be. I don't know the COW path in xfs so there are some gaps
> > > in my understanding. Let me know if you need any other information since
> > > I'm reliably able to replicate on 6.17.0-rc4.
> > >
> >
> > I haven't followed your issue closely, but just on this hole vs. COW
> > thing, XFS has a bit of a quirk where speculative COW fork preallocation
> > can expand out over holes in the data fork. If iomap lookup for buffered
> > write sees COW fork blocks present, it reports those blocks as the
> > primary mapping even if the data fork happens to be a hole (since
> > there's no point in allocating blocks to the data fork when we can just
> > remap).
That sounds like a bug -- if a sub-fsblock write to an uncached file
range has to read data in from disk, then xfs needs to pass the data
fork mapping to iomap even if it's a read.
Can you capture the ftrace output of the iomap_iter_*map tracepoints?
> > Again I've no idea if this relates to your issue or what you're
> > referring to as a hole (i.e. data fork only?), but just pointing it out.
> > The latest iomap/xfs patches I posted a few days ago kind of dance
> > around this a bit, but I was somewhat hoping that maybe the cleanups
> > there would trigger some thoughts on better iomap reporting in that
> > regard.
>
> Hi Brian, Thanks for the details and yes by "hole" i did mean hole in
> data fork only. The part that I'm now confused about is does this sort
> of preallocation extent hold any valid data? IIUC it should not, so I
No. Mappings in the cow fork are not fully written and should never be
used for reads.
> would expect it to trigger iomap_block_needs_zeroing() to write zeroes
> to the folio. Instead, what I see in the issue is that we are trying to
> do disk read.
Hrm. Part of the problem here might be that iomap_read_folio_range
ignores iomap_iter::srcmap if it's type IOMAP_HOLE (see
iomap_iter_srcmap), even if the filesystem actually *set* the srcmap to
a hole.
FWIW I see a somewhat different failure -- not data corruption, but
pwrite returning failure:
--- /run/fstests/bin/tests/generic/521.out 2025-07-15 14:45:15.100315255 -0700
+++ /var/tmp/fstests/generic/521.out.bad 2025-10-21 10:33:39.032263811 -0700
@@ -1,2 +1,668 @@
QA output created by 521
+dowrite: write: Input/output error
+LOG DUMP (661 total operations):
+1( 1 mod 256): TRUNCATE UP from 0x0 to 0x1d000
+2( 2 mod 256): DEDUPE 0x19000 thru 0x1bfff (0x3000 bytes) to 0x13000 thru 0x15fff
+3( 3 mod 256): SKIPPED (no operation)
+4( 4 mod 256): PUNCH 0x5167 thru 0x12d1c (0xdbb6 bytes)
+5( 5 mod 256): WRITE 0x79000 thru 0x86fff (0xe000 bytes) HOLE
+6( 6 mod 256): PUNCH 0x32344 thru 0x36faf (0x4c6c bytes)
+7( 7 mod 256): READ 0x0 thru 0xfff (0x1000 bytes)
+8( 8 mod 256): WRITE 0xe000 thru 0x11fff (0x4000 bytes)
+9( 9 mod 256): PUNCH 0x71324 thru 0x86fff (0x15cdc bytes)
+10( 10 mod 256): MAPREAD 0x5b000 thru 0x6d218 (0x12219 bytes)
+11( 11 mod 256): COLLAPSE 0x70000 thru 0x79fff (0xa000 bytes)
+12( 12 mod 256): WRITE 0x41000 thru 0x50fff (0x10000 bytes)
+13( 13 mod 256): INSERT 0x39000 thru 0x4dfff (0x15000 bytes)
+14( 14 mod 256): WRITE 0x34000 thru 0x37fff (0x4000 bytes)
+15( 15 mod 256): MAPREAD 0x55000 thru 0x6ee44 (0x19e45 bytes)
+16( 16 mod 256): READ 0x46000 thru 0x55fff (0x10000 bytes)
+17( 17 mod 256): PUNCH 0x1ccea thru 0x23b2e (0x6e45 bytes)
+18( 18 mod 256): COPY 0x2a000 thru 0x35fff (0xc000 bytes) to 0x52000 thru 0x5dfff
+19( 19 mod 256): SKIPPED (no operation)
+20( 20 mod 256): WRITE 0x10000 thru 0x1ffff (0x10000 bytes)
<snip>
+645(133 mod 256): READ 0x5000 thru 0x16fff (0x12000 bytes)
+646(134 mod 256): PUNCH 0x3a51d thru 0x41978 (0x745c bytes)
+647(135 mod 256): FALLOC 0x47f4c thru 0x54867 (0xc91b bytes) INTERIOR
+648(136 mod 256): WRITE 0xa000 thru 0x1dfff (0x14000 bytes)
+649(137 mod 256): CLONE 0x83000 thru 0x89fff (0x7000 bytes) to 0x4b000 thru 0x51fff
+650(138 mod 256): TRUNCATE DOWN from 0x8bac4 to 0x7e000
+651(139 mod 256): MAPWRITE 0x13000 thru 0x170e6 (0x40e7 bytes)
+652(140 mod 256): XCHG 0x6a000 thru 0x7cfff (0x13000 bytes) to 0x8000 thru 0x1afff
+653(141 mod 256): XCHG 0x35000 thru 0x3cfff (0x8000 bytes) to 0x1b000 thru 0x22fff
+654(142 mod 256): CLONE 0x47000 thru 0x60fff (0x1a000 bytes) to 0x65000 thru 0x7efff
+655(143 mod 256): DEDUPE 0x79000 thru 0x7dfff (0x5000 bytes) to 0x6e000 thru 0x72fff
+656(144 mod 256): XCHG 0x4d000 thru 0x5ffff (0x13000 bytes) to 0x8000 thru 0x1afff
+657(145 mod 256): PUNCH 0x7194f thru 0x7efff (0xd6b1 bytes)
+658(146 mod 256): PUNCH 0x7af7e thru 0x7efff (0x4082 bytes)
+659(147 mod 256): MAPREAD 0x77000 thru 0x7e55d (0x755e bytes)
+660(148 mod 256): READ 0x58000 thru 0x64fff (0xd000 bytes)
+661(149 mod 256): WRITE 0x88000 thru 0x8bfff (0x4000 bytes) HOLE
+Log of operations saved to "/mnt/junk.fsxops"; replay with --replay-ops
+Correct content saved for comparison
+(maybe hexdump "/mnt/junk" vs "/mnt/junk.fsxgood")
Curiously there are no EIO errors logged in dmesg.
--D
> Regards,
> ojaswin
> >
> > Brian
>
> >
> > > [1]
> > > https://github.com/OjaswinM/fsx-aw-issue/tree/master
> > >
> > > [2] https://github.com/OjaswinM/fsx-aw-issue/blob/master/repro.fsxops
> > >
> > > regards,
> > > ojaswin
> > >
> >
>
Powered by blists - more mailing lists