linux-kernel - Re: [PATCH] mm/filemap: Implement fast short reads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aPgZthYaP7Flda0z@dread.disaster.area>
Date: Wed, 22 Oct 2025 10:39:34 +1100
From: Dave Chinner <david@...morbit.com>
To: Kiryl Shutsemau <kirill@...temov.name>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	David Hildenbrand <david@...hat.com>,
	Matthew Wilcox <willy@...radead.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Alexander Viro <viro@...iv.linux.org.uk>,
	Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
	linux-mm@...ck.org, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	Suren Baghdasaryan <surenb@...gle.com>
Subject: Re: [PATCH] mm/filemap: Implement fast short reads

On Mon, Oct 20, 2025 at 12:33:08PM +0100, Kiryl Shutsemau wrote:
> On Sun, Oct 19, 2025 at 09:53:28PM -0700, Andrew Morton wrote:
> > On Fri, 17 Oct 2025 15:15:36 +0100 Kiryl Shutsemau <kirill@...temov.name> wrote:
> > 
> > > From: Kiryl Shutsemau <kas@...nel.org>
> > > 
> > > The protocol for page cache lookup is as follows:
> > > 
> > >   1. Locate a folio in XArray.
> > >   2. Obtain a reference on the folio using folio_try_get().
> > >   3. If successful, verify that the folio still belongs to
> > >      the mapping and has not been truncated or reclaimed.

What about if it has been hole-punched?

The i_size checks after testing the folio is up to date catch races
with truncate down.  This "works" because truncate_setsize() changes
i_size before we invalidate the mapping and so we don't try to
access folios that have pending invalidation.

It also catches the case where the invalidation is only a partial
EOF folio zero (e.g. truncate down within the same EOF folio). In
this case, the deletion sequence number won't catch the invalidation
because no pages are freed from the page cache. Hence reads need to
check i_size to detect this case.

However, fallocate() operations such as hole punching and extent
shifting have exactly the same partial folio invalidation problems
as truncate but they don't change i_size like truncate does (both at
the front and rear edges of the ranges being operated on)

Hence invalidation races with fallocate() operations cannot be
detected via i_size checks and we have to serialise them differently.
fallocate() also requires barriers prevent new page cache operations
whilst the filesystem operation is in progress, so we actually need
the invalidation serialisation to also act as a page cache instantiation
barrier. This is what the mapping->invalidate_lock provides, and I
suspect that this new read fast path doesn't actually work correctly
w.r.t. fallocate() based invalidation because it can't detect races
with partial folio invalidations that are pending nor does it take
the mapping->invalidate_lock....

I also wonder if there might be other subtle races with
->remap_file_range based operations, because they also run
invalidations and need page cache instatiation barriers whilst the
operations run.  At least with XFS, remap operations hold both the
inode->i_rwsem and the mapping->invalidate_lock so nobody can access
the page cache across the destination range being operated on whilst
the extent mapping underlying the file is in flux.

Given these potential issues, I really wonder if this niche fast
path is really worth the potential pain racing against these sorts
of operations could bring us. It also increases the cognitive
load for anyone trying to understand how buffered reads interact
with everything else (i.e. yet another set of race conditions we
have to worry about when thinking about truncate!), and it is not
clear to me that it is (or can be made) safe w.r.t. more complex
invalidation interactions that filesystem have to handle these days.

So: is the benefit for this niche workload really worth the
additional complexity it adds to what is already a very complex set
of behaviours and interactions?

> > > +	if (!folio_test_referenced(folio))
> > > +		return 0;
> > > +
> > > +	/* i_size check must be after folio_test_uptodate() */
> > 
> > why?
> 
> There is comment for i_size_read() in slow path that inidicates that it
> is required, but, to be honest, I don't fully understand interaction
> uptodate vs i_size here.

As per above, it's detecting a race with a concurrent truncate that
is about to invalidate the folio but hasn't yet got to that folio in
the mapping.

This is where we'd also need to detect pending fallocate() or other
invalidations that are in progress, but there's no way to do that
easily....

-Dave.
-- 
Dave Chinner
david@...morbit.com