linux-kernel - Re: [PATCH] mm/filemap: Implement fast short reads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <44ubh4cybuwsb4b6na3m4h3yrjbweiso5pafzgf57a4wgzd235@pgl54elpqgxa>
Date: Mon, 20 Oct 2025 12:33:08 +0100
From: Kiryl Shutsemau <kirill@...temov.name>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: David Hildenbrand <david@...hat.com>, 
	Matthew Wilcox <willy@...radead.org>, Linus Torvalds <torvalds@...ux-foundation.org>, 
	Alexander Viro <viro@...iv.linux.org.uk>, Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>, 
	linux-mm@...ck.org, linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Suren Baghdasaryan <surenb@...gle.com>
Subject: Re: [PATCH] mm/filemap: Implement fast short reads

On Sun, Oct 19, 2025 at 09:53:28PM -0700, Andrew Morton wrote:
> On Fri, 17 Oct 2025 15:15:36 +0100 Kiryl Shutsemau <kirill@...temov.name> wrote:
> 
> > From: Kiryl Shutsemau <kas@...nel.org>
> > 
> > The protocol for page cache lookup is as follows:
> > 
> >   1. Locate a folio in XArray.
> >   2. Obtain a reference on the folio using folio_try_get().
> >   3. If successful, verify that the folio still belongs to
> >      the mapping and has not been truncated or reclaimed.
> >   4. Perform operations on the folio, such as copying data
> >      to userspace.
> >   5. Release the reference.
> > 
> > For short reads, the overhead of atomic operations on reference
> > manipulation can be significant, particularly when multiple tasks access
> > the same folio, leading to cache line bouncing.
> > 
> > To address this issue, introduce i_pages_delete_seqcnt, which increments
> > each time a folio is deleted from the page cache and implement a modified
> > page cache lookup protocol for short reads:
> > 
> >   1. Locate a folio in XArray.
> >   2. Take note of the i_pages_delete_seqcnt.
> >   3. Copy the data to a local buffer on the stack.
> >   4. Verify that the i_pages_delete_seqcnt has not changed.
> >   5. Copy the data from the local buffer to the iterator.
> > 
> > If any issues arise in the fast path, fallback to the slow path that
> > relies on the refcount to stabilize the folio.
> 
> Well this is a fun patch.

Yes! :P

> > The new approach requires a local buffer in the stack. The size of the
> > buffer determines which read requests are served by the fast path. Set
> > the buffer to 1k. This seems to be a reasonable amount of stack usage
> > for the function at the bottom of the call stack.
> 
> A use case for alloca() or equiv.  That would improve the average-case
> stack depth but not the worst-case.

__kstack_alloca()/__builtin_alloca() would work and it bypassed
-Wframe-larger-than warning.

But I don't see any real users.

My understanding is that alloca() is similar in semantics with VLAs that
we eliminated from the kernel. I am not sure it is a good idea.

Other option is to have a per-CPU buffer. But it is less cache friendly.

> The 1k guess-or-giggle is crying out for histogramming - I bet 0.1k
> covers the great majority.  I suspect it wouldn't be hard to add a new
> ad-hoc API into memory allocation profiling asking it to profile
> something like this for us, given an explicit request to to do.

Let me see what I can do.

> Is there really no way to copy the dang thing straight out to
> userspace, skip the bouncing?

I don't see a way to make it in a safe manner.

In case of a race with folio deletion we risk leaking data to userspace:
the folio we are reading from can be freed and re-allocated from under
us to random other user. Bounce buffer let's us stabilize the data
before exposing it to userspace.

> > The fast read approach demonstrates significant performance
> > improvements, particularly in contended cases.
> > 
> > 16 threads, reads from 4k file(s), mean MiB/s (StdDev)
> > 
> >  -------------------------------------------------------------
> > | Block |  Baseline  |  Baseline   |  Patched   |  Patched    |
> > | size  |  same file |  diff files |  same file | diff files  |
> >  -------------------------------------------------------------
> > |     1 |    10.96   |     27.56   |    30.42   |     30.4    |
> > |       |    (0.497) |     (0.114) |    (0.130) |     (0.158) |
> > |    32 |   350.8    |    886.2    |   980.6    |    981.8    |
> > |       |   (13.64)  |     (2.863) |    (3.361) |     (1.303) |
> > |   256 |  2798      |   7009.6    |  7641.4    |   7653.6    |
> > |       |  (103.9)   |    (28.00)  |   (33.26)  |    (25.50)  |
> 
> tl;dr, 256-byte reads from the same file nearly tripled.

Yep.

> > |  1024 | 10780      |  27040      | 29280      |  29320      |
> > |       |  (389.8)   |    (89.44)  |  (130.3)   |    (83.66)  |
> > |  4096 | 43700      | 103800      | 48420      | 102000      |
> > |       | (1953)     |    (447.2)  | (2012)     |     (0)     |
> >  -------------------------------------------------------------
> > 
> > 16 threads, reads from 1M file(s), mean MiB/s (StdDev)
> > 
> >  --------------------------------------------------------------
> > | Block |  Baseline   |  Baseline   |  Patched    |  Patched   |
> > | size  |  same file  |  diff files |  same file  | diff files |
> >  ---------------------------------------------------------
> > |     1 |     26.38   |     27.34   |     30.38   |    30.36   |
> > |       |     (0.998) |     (0.114) |     (0.083) |    (0.089) |
> > |    32 |    824.4    |    877.2    |    977.8    |   975.8    |
> > |       |    (15.78)  |     (3.271) |     (2.683) |    (1.095) |
> > |   256 |   6494      |   6992.8    |   7619.8    |   7625     |
> > |       |   (116.0)   |    (32.39)  |    (10.66)  |    (28.19) |
> > |  1024 |  24960      |  26840      |  29100      |  29180     |
> > |       |   (606.6)   |   (151.6)   |   (122.4)   |    (83.66) |
> > |  4096 |  94420      | 100520      |  95260      |  99760     |
> > |       |  (3144)     |   (672.3)   |  (2874)     |   (134.1)  |
> > | 32768 | 386000      | 402400      | 368600      | 397400     |
> > |       | (36599)     | (10526)     | (47188)     |  (6107)    |
> >  --------------------------------------------------------------
> > 
> > There's also improvement on kernel build:
> > 
> > Base line: 61.3462 +- 0.0597 seconds time elapsed  ( +-  0.10% )
> > Patched:   60.6106 +- 0.0759 seconds time elapsed  ( +-  0.13% )
> > 
> > ...
> >
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> >
> > ...
> >
> > -ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
> > -		ssize_t already_read)
> > +static inline unsigned long filemap_read_fast_rcu(struct address_space *mapping,
> > +						  loff_t pos, char *buffer,
> > +						  size_t size)
> > +{
> > +	XA_STATE(xas, &mapping->i_pages, pos >> PAGE_SHIFT);
> > +	struct folio *folio;
> > +	loff_t file_size;
> > +	unsigned int seq;
> > +
> > +	lockdep_assert_in_rcu_read_lock();
> > +
> > +	/* Give up and go to slow path if raced with page_cache_delete() */
> > +	if (!raw_seqcount_try_begin(&mapping->i_pages_delete_seqcnt, seq))
> > +		return false;
> 
> 	return 0;

Ack.

> > +
> > +	folio = xas_load(&xas);
> > +	if (xas_retry(&xas, folio))
> > +		return 0;
> > +
> > +	if (!folio || xa_is_value(folio))
> > +		return 0;
> > +
> > +	if (!folio_test_uptodate(folio))
> > +		return 0;
> > +
> > +	/* No fast-case if readahead is supposed to started */
> 
> Please expand this comment.  "explain why, not what".  Why do we care
> if it's under readahead?  It's uptodate, so just grab it??

Readahead pages require kicking rickahead machinery with
filemap_readahead(). It is not appropriate for the fast path.

Will rewrite the comment.

> > +	if (folio_test_readahead(folio))
> > +		return 0;
> > +	/* .. or mark it accessed */
> 
> This comment disagrees with the code which it is commenting.

It is continuation of the comment for the readahead. Will rewrite to make
it clear.

Similar to readahead, we don't want to go for folio_mark_accessed().

> > +	if (!folio_test_referenced(folio))
> > +		return 0;
> > +
> > +	/* i_size check must be after folio_test_uptodate() */
> 
> why?

There is comment for i_size_read() in slow path that inidicates that it
is required, but, to be honest, I don't fully understand interaction
uptodate vs i_size here.

> > +	file_size = i_size_read(mapping->host);
> > +	if (unlikely(pos >= file_size))
> > +		return 0;
> > +	if (size > file_size - pos)
> > +		size = file_size - pos;
> 
> min() is feeling all sad?

Will make it happy. :)

> > +	/* Do the data copy */
> 
> We can live without this comment ;)

:P

> > +	size = memcpy_from_file_folio(buffer, folio, pos, size);
> > +	if (!size)
> > +		return 0;
> > +
> > +	/* Give up and go to slow path if raced with page_cache_delete() */
> 
> I wonder if truncation is all we need to worry about here.  For
> example, direct-io does weird stuff.

Direct I/O does page cache invalidation which is also goes via
page_cache_delete().

Reclaim also terminates in page_cache_delete() via
__filemap_remove_folio().

> > +	if (read_seqcount_retry(&mapping->i_pages_delete_seqcnt, seq))
> > +		return 0;
> > +
> > +	return size;
> > +}
> > +
> > +#define FAST_READ_BUF_SIZE 1024
> > +
> > +static noinline bool filemap_read_fast(struct kiocb *iocb, struct iov_iter *iter,
> > +				       ssize_t *already_read)
> > +{
> > +	struct address_space *mapping = iocb->ki_filp->f_mapping;
> > +	struct file_ra_state *ra = &iocb->ki_filp->f_ra;
> > +	char buffer[FAST_READ_BUF_SIZE];
> > +	size_t count;
> > +
> > +	if (ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE)
> > +		return false;
> 
> why?  (comment please)

I don't understand implication of flush_dcache_folio() on serialization
here. Will read up.

> > +	if (iov_iter_count(iter) > sizeof(buffer))
> > +		return false;
> > +
> > +	count = iov_iter_count(iter);
> 
> It'd be a tinier bit tidier to swap the above to avoid evaluating
> iov_iter_count() twice.  Yes, iov_iter_count() happens to be fast, but
> we aren't supposed to know that here.

Okay.

> > +	/* Let's see if we can just do the read under RCU */
> > +	rcu_read_lock();
> > +	count = filemap_read_fast_rcu(mapping, iocb->ki_pos, buffer, count);
> > +	rcu_read_unlock();
> > +
> > +	if (!count)
> > +		return false;
> > +
> > +	count = copy_to_iter(buffer, count, iter);
> > +	if (unlikely(!count))
> > +		return false;
> > +
> > +	iocb->ki_pos += count;
> > +	ra->prev_pos = iocb->ki_pos;
> > +	file_accessed(iocb->ki_filp);
> > +	*already_read += count;
> > +
> > +	return !iov_iter_count(iter);
> > +}
> > +
> > +static noinline ssize_t filemap_read_slow(struct kiocb *iocb,
> > +					  struct iov_iter *iter,
> > +					  ssize_t already_read)
> >  {
> >  	struct file *filp = iocb->ki_filp;
> >  	struct file_ra_state *ra = &filp->f_ra;
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov