linux-kernel - Re: [PATCH] mm/filemap: Implement fast short reads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dupgze7vl2vvndyasmm34ebhzxzumv3sz425qvbquruzvqgf4r@q66h2eeaxs7h>
Date: Thu, 23 Oct 2025 11:37:24 +0200
From: Jan Kara <jack@...e.cz>
To: Dave Chinner <david@...morbit.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>, 
	Kiryl Shutsemau <kirill@...temov.name>, Andrew Morton <akpm@...ux-foundation.org>, 
	David Hildenbrand <david@...hat.com>, Matthew Wilcox <willy@...radead.org>, 
	Alexander Viro <viro@...iv.linux.org.uk>, Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>, 
	linux-mm@...ck.org, linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Suren Baghdasaryan <surenb@...gle.com>
Subject: Re: [PATCH] mm/filemap: Implement fast short reads

On Thu 23-10-25 18:50:46, Dave Chinner wrote:
> On Wed, Oct 22, 2025 at 05:31:12AM -1000, Linus Torvalds wrote:
> > On Tue, 21 Oct 2025 at 22:00, Dave Chinner <david@...morbit.com> wrote:
> > >
> > > On Tue, Oct 21, 2025 at 06:25:30PM -1000, Linus Torvalds wrote:
> > > >
> > > > The sequence number check should take care of anything like that. Do
> > > > you have any reason to believe it doesn't?
> > >
> > > Invalidation doing partial folio zeroing isn't covered by the page
> > > cache delete sequence number.
> > 
> > Correct - but neither is it covered by anything else in the *regular* read path.
> > 
> > So the sequence number protects against the same case that the
> > reference count protects against: hole punching removing the whole
> > page.
> > 
> > Partial page hole-punching will fundamentally show half-way things.
> 
> Only when you have a busted implementation of the spec.
> 
> Think about it: if I said "partial page truncation will
> fundamentally show half-way things", you would shout at me that
> truncate must -never- expose half-way things to buffered reads.
> This is how truncate is specified to behave, and we don't violate
> the spec just because it is hard to implement it.

Well, as a matter of fact we can expose part-way results of truncate for
ext4 and similar filesystems not serializing reads to truncate with inode
lock. In particular for ext4 there's the i_size check in filemap_read() but
if that passes before the truncate starts, the code copying out data from
the pages can race with truncate zeroing out tail of the last page.

> We've broken truncate repeatedly over the past 20+ years in ways
> that have exposed stale data to users. This is always considered a
> critical bug that needs to be fixed ASAP.

Exposing data that was never in the file is certainly a critical bug.
Showing a mix of old and new data is not great but less severe and it seems
over the years userspace on Linux learned to live with it and reap the
performance benefit (e.g. for mixed read-write workloads to one file)...

<snip>

> Hence there is really only one behaviour that is required: whilst
> the low level operation is taking place, no external IO (read,
> write, discard, etc) can be performed over that range of the file
> being zeroed because the data andor metadata is not stable until the
> whole operation is completed by the filesystem.
> 
> Now, this doesn't obviously read on the initial invalidation races
> that are the issue being discussed here because zero's written by
> invalidation could be considered "valid" for hole punch, zero range,
> etc.
> 
> However, consider COLLAPSE_RANGE.  Page cache invalidation
> writing zeros and reads racing with that is a problem, because
> the old data at a given offset is non-zero, whilst the new data at
> the same offset is alos non-zero.
> 
> Hence if we allow the initial page cache invalidation to race with
> buffered reads, there is the possibility of random zeros appearing
> in the data being read. Because this is not old or new data, it is
> -corrupt- data.

Well, reasons like this are why for operations like COLLAPSE_RANGE ext4
reclaims the whole interval of the page cache starting with the first
affected folio to the end. So again user will either see old data (if it
managed to get the page before we invalidated the page cache) or the new
data (when it needs to read from the disk which is properly synchronized
with COLLAPSE_RANGE through invalidate_lock). I don't see these speculative
accesses changing anything in this case either.
 
> Put simply, these fallocate operations should *never* see partial
> invalidation data, and so the "old or new data" rule *must* apply to
> the initial page cache invalidation these fallocate() operations do.
> 
> Hence various fallocate() operations need to act as a full IO
> barrier. Buffered IO, page faults and direct IO all must be blocked
> and drained before the invalidation of the range begins, and must
> not be allowed to start again until after the whole operation
> completes.

Hum, I'm not sure I follow you correctly but what you describe doesn't seem
like how ext4 works. There are two different things - zeroing out of
partial folios affected by truncate, hole punch, zero range (other
fallocate operations don't zero out) and invalidation of the page cache
folios. For ext4 it is actually the removal of folios from the page cache
during invalidation + holding invalidate_lock that synchronizes with reads.
As such zeroing of partial folios *can* actually race with reads within
these partial folios and so you can get a mix of zeros and old data from
reads.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR