[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPd1aKVFImgXpULn@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
Date: Tue, 21 Oct 2025 17:28:32 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Brian Foster <bfoster@...hat.com>
Cc: John Garry <john.g.garry@...cle.com>, Zorro Lang <zlang@...hat.com>,
fstests@...r.kernel.org, Ritesh Harjani <ritesh.list@...il.com>,
djwong@...nel.org, tytso@....edu, linux-xfs@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-ext4@...r.kernel.org
Subject: Re: [PATCH v7 04/12] ltp/fsx.c: Add atomic writes support to fsx
On Tue, Oct 21, 2025 at 07:30:32AM -0400, Brian Foster wrote:
> On Tue, Oct 21, 2025 at 03:58:23PM +0530, Ojaswin Mujoo wrote:
> > On Mon, Oct 20, 2025 at 11:33:40AM +0100, John Garry wrote:
> > > On 06/10/2025 14:20, Ojaswin Mujoo wrote:
> > > > Hi Zorro, thanks for checking this. So correct me if im wrong but I
> > > > understand that you have run this test on an atomic writes enabled
> > > > kernel where the stack also supports atomic writes.
> > > >
> > > > Looking at the bad data log:
> > > >
> > > > +READ BAD DATA: offset = 0x1c000, size = 0x1803, fname = /mnt/xfstests/test/junk
> > > > +OFFSET GOOD BAD RANGE
> > > > +0x1c000 0x0000 0xcdcd 0x0
> > > > +operation# (mod 256) for the bad data may be 205
> > > >
> > > > We see that 0x0000 was expected but we got 0xcdcd. Now the operation
> > > > that caused this is indicated to be 205, but looking at that operation:
> > > >
> > > > +205(205 mod 256): ZERO 0x6dbe6 thru 0x6e6aa (0xac5 bytes)
> > > >
> > > > This doesn't even overlap the range that is bad. (0x1c000 to 0x1c00f).
> > > > Infact, it does seem like an unlikely coincidence that the actual data
> > > > in the bad range is 0xcdcd which is something xfs_io -c "pwrite" writes
> > > > to default (fsx writes random data in even offsets and operation num in
> > > > odd).
> > > >
> > > > I am able to replicate this but only on XFS but not on ext4 (atleast not
> > > > in 20 runs). I'm trying to better understand if this is a test issue or
> > > > not. Will keep you update.
> > >
> > >
> > > Hi Ojaswin,
> > >
> > > Sorry for the very slow response.
> > >
> > > Are you still checking this issue?
> > >
> > > To replicate, should I just take latest xfs kernel and run this series on
> > > top of latest xfstests? Is it 100% reproducible?
> > >
> > > Thanks,
> > > John
> >
> > Hi John,
> >
> > Yes Im looking into it but I'm now starting to run into some reflink/cow
> > based concepts that are taking time to understand. Let me share what I
> > have till now:
> >
> > So the test.sh that I'm using can be found here [1] which just uses an
> > fsx replay file (which replays all operations) present in the same repo
> > [2]. If you see the replay file, there are a bunch of random operations
> > followed by the last 2 commented out operations:
> >
> > # copy_range 0xd000 0x1000 0x1d800 0x44000 <--- # operations <start> <len> <dest of copy> <filesize (can be ignored)>
> > # mapread 0x1e000 0x1000 0x1e400 *
> >
> > The copy_range here is the one which causes (or exposes) the corruption
> > at 0x1e800 (the end of copy range destination gets corrupted).
> >
> > To have more control, I commented these 2 operations and am doing it by
> > hand in the test.sh file, with xfs_io. I'm also using a non atomic write
> > device so we only have S/W fallback.
> >
> > Now some observations:
> >
> > 1. The copy_range operations is actually copying from a hole to a hole,
> > so we should be reading all 0s. But What I see is the following happening:
> >
> > vfs_copy_file_range
> > do_splice_direct
> > do_splice_direct_actor
> > do_splice_read
> > # Adds the folio at src offset to the pipe. I confirmed this is all 0x0.
> > splice_direct_to_actor
> > direct_splice_actor
> > do_splice_from
> > iter_file_splice_write
> > xfs_file_write_iter
> > xfs_file_buffered_write
> > iomap_file_buferred_write
> > iomap_iter
> > xfs_buferred_write_iomap_begin
> > # Here we correctly see that there is noting at the
> > # destination in data fork, but somehow we find a mapped
> > # extent in cow fork which is returned to iomap.
> > iomap_write_iter
> > __iomap_write_begin
> > # Here we notice folio is not uptodate and call
> > # iomap_read_folio_range() to read from the cow_fork
> > # mapping we found earlier. This results in folio having
> > # incorrect data at 0x1e800 offset.
> >
> > So it seems like the fsx operations might be corrupting the cow fork state
> > somehow leading to stale data exposure.
> >
> > 2. If we disable atomic writes we dont hit the issue.
> >
> > 3. If I do a -c pread of the destination range before doing the
> > copy_range operation then I don't see the corruption any more.
> >
> > I'm now trying to figure out why the mapping returned is not IOMAP_HOLE
> > as it should be. I don't know the COW path in xfs so there are some gaps
> > in my understanding. Let me know if you need any other information since
> > I'm reliably able to replicate on 6.17.0-rc4.
> >
>
> I haven't followed your issue closely, but just on this hole vs. COW
> thing, XFS has a bit of a quirk where speculative COW fork preallocation
> can expand out over holes in the data fork. If iomap lookup for buffered
> write sees COW fork blocks present, it reports those blocks as the
> primary mapping even if the data fork happens to be a hole (since
> there's no point in allocating blocks to the data fork when we can just
> remap).
>
> Again I've no idea if this relates to your issue or what you're
> referring to as a hole (i.e. data fork only?), but just pointing it out.
> The latest iomap/xfs patches I posted a few days ago kind of dance
> around this a bit, but I was somewhat hoping that maybe the cleanups
> there would trigger some thoughts on better iomap reporting in that
> regard.
Hi Brian, Thanks for the details and yes by "hole" i did mean hole in
data fork only. The part that I'm now confused about is does this sort
of preallocation extent hold any valid data? IIUC it should not, so I
would expect it to trigger iomap_block_needs_zeroing() to write zeroes
to the folio. Instead, what I see in the issue is that we are trying to
do disk read.
Regards,
ojaswin
>
> Brian
>
> > [1]
> > https://github.com/OjaswinM/fsx-aw-issue/tree/master
> >
> > [2] https://github.com/OjaswinM/fsx-aw-issue/blob/master/repro.fsxops
> >
> > regards,
> > ojaswin
> >
>
Powered by blists - more mailing lists