linux-kernel - Re: [PATCH] mm/filemap: Implement fast short reads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPneVmBuuTHGQBgl@dread.disaster.area>
Date: Thu, 23 Oct 2025 18:50:46 +1100
From: Dave Chinner <david@...morbit.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Kiryl Shutsemau <kirill@...temov.name>,
	Andrew Morton <akpm@...ux-foundation.org>,
	David Hildenbrand <david@...hat.com>,
	Matthew Wilcox <willy@...radead.org>,
	Alexander Viro <viro@...iv.linux.org.uk>,
	Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
	linux-mm@...ck.org, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	Suren Baghdasaryan <surenb@...gle.com>
Subject: Re: [PATCH] mm/filemap: Implement fast short reads

On Wed, Oct 22, 2025 at 05:31:12AM -1000, Linus Torvalds wrote:
> On Tue, 21 Oct 2025 at 22:00, Dave Chinner <david@...morbit.com> wrote:
> >
> > On Tue, Oct 21, 2025 at 06:25:30PM -1000, Linus Torvalds wrote:
> > >
> > > The sequence number check should take care of anything like that. Do
> > > you have any reason to believe it doesn't?
> >
> > Invalidation doing partial folio zeroing isn't covered by the page
> > cache delete sequence number.
> 
> Correct - but neither is it covered by anything else in the *regular* read path.
> 
> So the sequence number protects against the same case that the
> reference count protects against: hole punching removing the whole
> page.
> 
> Partial page hole-punching will fundamentally show half-way things.

Only when you have a busted implementation of the spec.

Think about it: if I said "partial page truncation will
fundamentally show half-way things", you would shout at me that
truncate must -never- expose half-way things to buffered reads.
This is how truncate is specified to behave, and we don't violate
the spec just because it is hard to implement it.

We've broken truncate repeatedly over the past 20+ years in ways
that have exposed stale data to users. This is always considered a
critical bug that needs to be fixed ASAP.

Hole punching is documented to require the same behaviour as
truncate and, like truncate, any breakage of that is a potential
stale kernel or physical data exposure event.

Why will we not tolerate data corruption bugs in truncate due to
page cache races, yet being told that we are supposed to just ignore
those same data corruption races during fallocate operations?

This seems like a double standard to me. 

Also, keep in mind that XFS isn't exposed to these bugs, so I have
no real skin in the game here. People need to be made aware of of a
data integrity issue that are noticed, not have them swept under the
rug with dubious reasoning.

> > > Yes, you can get the "before or after or between" behavior, but you
> > > can get that with perfectly regular reads that take the refcount on
> > > the page.
> >
> > Yes, and it is the "in between" behaviour that is the problem here.
> >
> > Hole punching (and all the other fallocate() operations) are
> > supposed to be atomic w.r.t. user IO. i.e. you should see either the
> > non-punched data or the punched data, never a mix of the two. A mix
> > of the two is a transient data corruption event....
> 
> That "supposed" comes from documentation that has never been true and
> as such is just a bedtime story.

I'll give you the benefit of the doubt here, because you are not
an expert in the field.

That is, I think you are conflating POSIX buffered read/write
syscall atomicity with fallocate() requirements. They are not the
same thing.  The fallocate() atomicity requirements come from the
low level data coherency/integrity requirements of the operations
being performed, not from some POSIX spec.

e.g. FALLOC_FL_ZERO_RANGE turns a range of a file to zeros. The
implementation of this can vary. The filesystem can:
	- write zeros through the page cache. Users have asked us
	  not to do this but return -EOPNOTSUPP so they can choose
	  the fallback mechanism themselves.
	- use block device offloads like WRITE_SAME(0) to have the
	  device physically zero the range.
	- use DISCARD blockdev operations if the device guarantees
	  discarded regiond return zeros.
	- punch out all the extents, leaving a hole.
	- punch out all the extents, then preallocate new unwritten
	  extents thereby defragmenting the range at the same time.
	- preallocate new unwritten extents over holes and convert
	  existing written extents to unwritten.

All of these are valid implementations, and they are all multi-step
operations that could expose partial completion to userspace if
there is no IO serialisation against the ZERO_RANGE operation.

Hence there is really only one behaviour that is required: whilst
the low level operation is taking place, no external IO (read,
write, discard, etc) can be performed over that range of the file
being zeroed because the data andor metadata is not stable until the
whole operation is completed by the filesystem.

Now, this doesn't obviously read on the initial invalidation races
that are the issue being discussed here because zero's written by
invalidation could be considered "valid" for hole punch, zero range,
etc.

However, consider COLLAPSE_RANGE.  Page cache invalidation
writing zeros and reads racing with that is a problem, because
the old data at a given offset is non-zero, whilst the new data at
the same offset is alos non-zero.

Hence if we allow the initial page cache invalidation to race with
buffered reads, there is the possibility of random zeros appearing
in the data being read. Because this is not old or new data, it is
-corrupt- data.

Put simply, these fallocate operations should *never* see partial
invalidation data, and so the "old or new data" rule *must* apply to
the initial page cache invalidation these fallocate() operations do.

Hence various fallocate() operations need to act as a full IO
barrier. Buffered IO, page faults and direct IO all must be blocked
and drained before the invalidation of the range begins, and must
not be allowed to start again until after the whole operation
completes.

IOWs, IO exclusion is a data integrity requirement for fallocate()
operations to prevent users from being exposed to transient data
corruption races whilst the fallocate() operation is being
performed. It has nothing to do with POSIX in any way...

Also, keep in mind that fallocate() is only one vector to this race
condition.  Multiple filesystems have multiple invalidation vectors
to this race condition.  There are similar issues with custom extent
swap ioctls (e.g. for online defragmentation), ensuring
->remap_file_range() and ->copy_file_range() implementations have
exclusive access to the file data and/or metadata they are
manipulating (e.g. offload to hardware copy engines), and so on.

fallocate() is just a convenient example to use because multiple
filesystems implement it, lots of userspace applications use it and
lots of people know about it. It is not niche functionality anymore,
so people should be paying close attention when potential data
corruption vectors are identified in these operations...

> And no, iI'd argue that it's not even documenting desirable behavior,
> because that bedtime story has never been true because it's
> prohibitively expensive.

I disagree, and so do the numbers - IO path exclusion is pretty
cheap for the most part, and not a performance limiting requirement.

e.g. over a range of different benchmarks on 6.15:

https://www.phoronix.com/review/linux-615-filesystems/6

i.e. XFS is 27% faster overall than ext4 and btrfs on a modern NVMe
SSDs.  Numbers don't lie, and these numbers indicate that the IO
path exclusion that XFS implementations for avoiding invalidation
races doesn't have any impact on typical production workloads...

That said, I am most definitely not suggesting that the XFS IO path
exclusion is the solution for the specific buffered read vs cache
invalidation race I pointed out. It's just one example of how it can
be solved with little in the way of runtime overhead.

I suspect the invalidation race could be detected at the page cache
layer with a mark-and-sweep invalidation algorithm using xarray tags
kinda like writeback already does. Those tags could be detected in
the read path at little extra cost (the xarray node is already hot
in cache) which can then punt the folio to a slow path that does the
right thing.  If there's no invalidation tag present, then the fast
path can safely keep doing it's fast path things...

> I think it would be much better to fix the documentation, but that's
> generally out of our hands.

We control the fallocate() specification and documentation
ourselves. But that's not the problem here; the problem is that the
cached data invalidation mechanism used by fallocate operations does
not prevent buffered reads from racing against invalidation and
exposing invalid transient data to users...

-Dave.
-- 
Dave Chinner
david@...morbit.com