linux-kernel - Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJnrk1aPs2J_EerLROxtiHAKTyU2NHBkRXpS=-yunEsC9epAWw@mail.gmail.com>
Date: Tue, 10 Feb 2026 14:18:06 -0800
From: Joanne Koong <joannelkoong@...il.com>
To: Wei Gao <wegao@...e.com>
Cc: Sasha Levin <sashal@...nel.org>, willy@...radead.org, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate
 and folio_end_read

On Mon, Feb 9, 2026 at 4:40 PM Wei Gao <wegao@...e.com> wrote:
>
> On Mon, Feb 09, 2026 at 04:20:01PM -0800, Joanne Koong wrote:
> > On Mon, Feb 9, 2026 at 4:12 PM Wei Gao <wegao@...e.com> wrote:
> > >
> > > On Mon, Feb 09, 2026 at 11:08:50AM -0800, Joanne Koong wrote:
> > > > On Fri, Feb 6, 2026 at 11:16 PM Wei Gao <wegao@...e.com> wrote:
> > > > >
> > > > > On Tue, Dec 23, 2025 at 08:31:57PM -0500, Sasha Levin wrote:
> > > > > > On Tue, Dec 23, 2025 at 05:12:09PM -0800, Joanne Koong wrote:
> > > > > > > On Tue, Dec 23, 2025 at 2:30 PM Sasha Levin <sashal@...nel.org> wrote:
> > > > > > > >
> > > > > > >
> > > > > > > Hi Sasha,
> > > > > > >
> > > > > > > Thanks for your patch and for the detailed writeup.
> > > > > >
> > > > > > Thanks for looking into this!
> > > > > >
> > > > > > > > When iomap uses large folios, per-block uptodate tracking is managed via
> > > > > > > > iomap_folio_state (ifs). A race condition can cause the ifs uptodate bits
> > > > > > > > to become inconsistent with the folio's uptodate flag.
> > > > > > > >
> > > > > > > > The race occurs because folio_end_read() uses XOR semantics to atomically
> > > > > > > > set the uptodate bit and clear the locked bit:
> > > > > > > >
> > > > > > > >   Thread A (read completion):          Thread B (concurrent write):
> > > > > > > >   --------------------------------     --------------------------------
> > > > > > > >   iomap_finish_folio_read()
> > > > > > > >     spin_lock(state_lock)
> > > > > > > >     ifs_set_range_uptodate() -> true
> > > > > > > >     spin_unlock(state_lock)
> > > > > > > >                                        iomap_set_range_uptodate()
> > > > > > > >                                          spin_lock(state_lock)
> > > > > > > >                                          ifs_set_range_uptodate() -> true
> > > > > > > >                                          spin_unlock(state_lock)
> > > > > > > >                                          folio_mark_uptodate(folio)
> > > > > > > >     folio_end_read(folio, true)
> > > > > > > >       folio_xor_flags()  // XOR CLEARS uptodate!
> > > > > > >
> > > > > > > The part I'm confused about here is how this can happen between a
> > > > > > > concurrent read and write. My understanding is that the folio is
> > > > > > > locked when the read occurs and locked when the write occurs and both
> > > > > > > locks get dropped only when the read or write finishes. Looking at
> > > > > > > iomap code, I see iomap_set_range_uptodate() getting called in
> > > > > > > __iomap_write_begin() and __iomap_write_end() for the writes, but in
> > > > > > > both those places the folio lock is held while this is called. I'm not
> > > > > > > seeing how the read and write race in the diagram can happen, but
> > > > > > > maybe I'm missing something here?
> > > > > >
> > > > > > Hmm, you're right... The folio lock should prevent concurrent read/write
> > > > > > access. Looking at this again, I suspect that FUSE was calling
> > > > > > folio_clear_uptodate() and folio_mark_uptodate() directly without updating the
> > > > > > ifs bits. For example, in fuse_send_write_pages() on write error, it calls
> > > > > > folio_clear_uptodate(folio) which clears the folio flag but leaves ifs still
> > > > > > showing all blocks uptodate?
> > > > >
> > > > > Hi Sasha
> > > > > On PowerPC with 64KB page size, msync04 fails with SIGBUS on NTFS-FUSE. The issue stems from a state inconsistency between
> > > > > the iomap_folio_state (ifs) bitmap and the folio's Uptodate flag.
> > > > > tst_test.c:1985: TINFO: === Testing on ntfs ===
> > > > > tst_test.c:1290: TINFO: Formatting /dev/loop0 with ntfs opts='' extra opts=''
> > > > > Failed to set locale, using default 'C'.
> > > > > The partition start sector was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
> > > > > The number of sectors per track was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
> > > > > The number of heads was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
> > > > > To boot from a device, Windows needs the 'partition start sector', the 'sectors per track' and the 'number of heads' to be set.
> > > > > Windows will not be able to boot from this device.
> > > > > tst_test.c:1302: TINFO: Mounting /dev/loop0 to /tmp/LTP_msy3ljVxi/msync04 fstyp=ntfs flags=0
> > > > > tst_test.c:1302: TINFO: Trying FUSE...
> > > > > tst_test.c:1953: TBROK: Test killed by SIGBUS!
> > > > >
> > > > > Root Cause Analysis: When a page fault triggers fuse_read_folio, the iomap_read_folio_iter handles the request. For a 64KB page,
> > > > > after fetching 4KB via fuse_iomap_read_folio_range_async, the remaining 60KB (61440 bytes) is zero-filled via iomap_block_needs_zeroing,
> > > > > then iomap_set_range_uptodate marks the folio as Uptodate globally, after folio_xor_flags folio's uptodate become 0 again, finally trigger
> > > > > an SIGBUS issue in filemap_fault.
> > > >
> > > > Hi Wei,
> > > >
> > > > Thanks for your report. afaict, this scenario occurs only if the
> > > > server is a fuseblk server with a block size different from the memory
> > > > page size and if the file size is less than the size of the folio
> > > > being read in.
> > > Thanks for checking this and give quick feedback :)
> > > >
> > > > Could you verify that this snippet from Sasha's patch fixes the issue?:
> > > Yes, Sasha's patch can fixes the issue.
> >
> > I think just those lines I pasted from Sasha's patch is the relevant
> > fix. Could you verify that just those lines (without the changes
> > from the rest of his patch) fixes the issue?
> Yes, i just add two lines change in iomap_set_range_uptodate can fixes
> the issue.

Great, thank you for confirming.

Sasha, would you mind submitting this snippet of your patch as the fix
for the EOF zeroing issue? I think it could be restructured to

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 1fe19b4ee2f4..412e661871f8 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -87,7 +87,16 @@ static void iomap_set_range_uptodate(struct folio
*folio, size_t off,

        if (ifs) {
                spin_lock_irqsave(&ifs->state_lock, flags);
-               uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
+               /*
+                * If a read is in progress, we must NOT call
folio_mark_uptodate.
+                * The read completion path (iomap_finish_folio_read or
+                * iomap_read_end) will call folio_end_read() which uses XOR
+                * semantics to set the uptodate bit. If we set it here, the XOR
+                * in folio_end_read() will clear it, leaving the folio not
+                * uptodate.
+                */
+               uptodate = ifs_set_range_uptodate(folio, ifs, off, len) &&
+                       !ifs->read_bytes_pending;
                spin_unlock_irqrestore(&ifs->state_lock, flags);
        }

to be a bit more concise.

If you're busy and don't have the bandwidth, I'm happy to forward the
patch on your behalf with your Signed-off-by / authorship.

Thanks,
Joanne
> +               if (uptodate && ifs->read_bytes_pending)
> +                       uptodate = false;
> >
> > Thanks,
> > Joanne
> >
> >
> > > >
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index e5c1ca440d93..7ceda24cf6a7 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -87,12 +86,50 @@ static void iomap_set_range_uptodate(struct folio
> > > > *folio, size_t off,
> > > >   if (ifs) {
> > > >           spin_lock_irqsave(&ifs->state_lock, flags);
> > > >           uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
> > > >           + /*
> > > >           + * If a read is in progress, we must NOT call folio_mark_uptodate
> > > >           + * here. The read completion path (iomap_finish_folio_read or
> > > >           + * iomap_read_end) will call folio_end_read() which uses XOR
> > > >           + * semantics to set the uptodate bit. If we set it here, the XOR
> > > >           + * in folio_end_read() will clear it, leaving the folio not
> > > >           + * uptodate while the ifs says all blocks are uptodate.
> > > >           + */
> > > >          + if (uptodate && ifs->read_bytes_pending)
> > > >                    + uptodate = false;
> > > >         spin_unlock_irqrestore(&ifs->state_lock, flags);
> > > >   }
> > > >
> > > > Thanks,
> > > > Joanne
> > > >
> > > > >
> > > > > So your iomap_set_range_uptodate patch can fix above failed case since it block mark folio's uptodate to 1.
> > > > > Hope my findings are helpful.
> > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks,
> > > > > > Sasha
> > > > > >