[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPiPG1-VDV7ZV2_F@dread.disaster.area>
Date: Wed, 22 Oct 2025 19:00:27 +1100
From: Dave Chinner <david@...morbit.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Kiryl Shutsemau <kirill@...temov.name>,
Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>,
Matthew Wilcox <willy@...radead.org>,
Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
linux-mm@...ck.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org,
Suren Baghdasaryan <surenb@...gle.com>
Subject: Re: [PATCH] mm/filemap: Implement fast short reads
On Tue, Oct 21, 2025 at 06:25:30PM -1000, Linus Torvalds wrote:
> On Tue, 21 Oct 2025 at 13:39, Dave Chinner <david@...morbit.com> wrote:
> >
> > > > > 1. Locate a folio in XArray.
> > > > > 2. Obtain a reference on the folio using folio_try_get().
> > > > > 3. If successful, verify that the folio still belongs to
> > > > > the mapping and has not been truncated or reclaimed.
> >
> > What about if it has been hole-punched?
>
> The sequence number check should take care of anything like that. Do
> you have any reason to believe it doesn't?
Invalidation doing partial folio zeroing isn't covered by the page
cache delete sequence number.
> Yes, you can get the "before or after or between" behavior, but you
> can get that with perfectly regular reads that take the refcount on
> the page.
Yes, and it is the "in between" behaviour that is the problem here.
Hole punching (and all the other fallocate() operations) are
supposed to be atomic w.r.t. user IO. i.e. you should see either the
non-punched data or the punched data, never a mix of the two. A mix
of the two is a transient data corruption event....
This invalidation race does not exist on XFS, even on this
new fast path. We protect all buffered reads with the
inode->i_rwsem, so we guarantee they can't race
with fallocate() operations performing invalidations because
fallocate() takes the i_rwsem exclusively. This IO exclusion model
was baked into the XFS locking architecture over 30 years ago.
The problem is the other filesystems don't use this sort of IO
exclusion model (ext4, btrfs, etc) but instead use the page cache
folio locking to only avoid concurrent modification to the same file
range.
Hence they are exposed to this transient state because they rely on
folio locks for arbitrating concurrent access to the page cache and
buffered reads run completely unlocked. i.e. because....
> Reads have never taken the page lock, and have never been serialized that way.
... they are exposed to transient data states in the page cache
during invalidation operations. The i_size checks avoid these
transient states for truncation, but there are no checks that can be
made to avoid them for other sources of invalidation operations.
The mapping->invalidate_lock only prevents page cache instantiation
from occurring, allowing filesystems to create a barrier that
prevents page cache repopulation after invalidation until the
invalidate lock is dropped. This allows them to complete the
fallocate() operation before exposing the result to users.
However, it does not prevent buffered read cache hits racing with
overlapping invalidation operations, and that's the problem I'm
pointing out that this new fast path will still hit, even though it
tries to bounce-buffer it's way around transient states.
So, yes, you are right when you say:
> So the fast case changes absolutely nothing in this respect that I can see.
But that does not make the existing page cache behaviour desirable.
Reading corrupt data faster is still reading corrupt data :/
Really, though, I'm only mentioning this stuff beacuse both the
author of the patch and the reviewer did not seem to know how i_size
is used in this code to avoid truncate races. truncate races are the
simple part of the problem, hole punching is a lot harder to get
right. If the author hasn't thought about it, it is likely there
are subtle bugs lurking....
-Dave.
--
Dave Chinner
david@...morbit.com
Powered by blists - more mailing lists