lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dcdfb58c-5ba7-4015-9446-09d98449f022@redhat.com>
Date: Wed, 22 Oct 2025 19:28:27 +0200
From: David Hildenbrand <david@...hat.com>
To: Kiryl Shutsemau <kirill@...temov.name>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Matthew Wilcox <willy@...radead.org>,
 Linus Torvalds <torvalds@...ux-foundation.org>,
 Alexander Viro <viro@...iv.linux.org.uk>,
 Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>
Cc: linux-mm@...ck.org, linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org, Kiryl Shutsemau <kas@...nel.org>
Subject: Re: [PATCH] mm/filemap: Implement fast short reads

On 17.10.25 16:15, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@...nel.org>
> 
> The protocol for page cache lookup is as follows:
> 
>    1. Locate a folio in XArray.
>    2. Obtain a reference on the folio using folio_try_get().
>    3. If successful, verify that the folio still belongs to
>       the mapping and has not been truncated or reclaimed.
>    4. Perform operations on the folio, such as copying data
>       to userspace.
>    5. Release the reference.
> 
> For short reads, the overhead of atomic operations on reference
> manipulation can be significant, particularly when multiple tasks access
> the same folio, leading to cache line bouncing.
> 
> To address this issue, introduce i_pages_delete_seqcnt, which increments
> each time a folio is deleted from the page cache and implement a modified
> page cache lookup protocol for short reads:
> 
>    1. Locate a folio in XArray.
>    2. Take note of the i_pages_delete_seqcnt.
>    3. Copy the data to a local buffer on the stack.
>    4. Verify that the i_pages_delete_seqcnt has not changed.
>    5. Copy the data from the local buffer to the iterator.
> 
> If any issues arise in the fast path, fallback to the slow path that
> relies on the refcount to stabilize the folio.
> 
> The new approach requires a local buffer in the stack. The size of the
> buffer determines which read requests are served by the fast path. Set
> the buffer to 1k. This seems to be a reasonable amount of stack usage
> for the function at the bottom of the call stack.
> 
> The fast read approach demonstrates significant performance
> improvements, particularly in contended cases.
> 
> 16 threads, reads from 4k file(s), mean MiB/s (StdDev)
> 
>   -------------------------------------------------------------
> | Block |  Baseline  |  Baseline   |  Patched   |  Patched    |
> | size  |  same file |  diff files |  same file | diff files  |
>   -------------------------------------------------------------
> |     1 |    10.96   |     27.56   |    30.42   |     30.4    |
> |       |    (0.497) |     (0.114) |    (0.130) |     (0.158) |
> |    32 |   350.8    |    886.2    |   980.6    |    981.8    |
> |       |   (13.64)  |     (2.863) |    (3.361) |     (1.303) |
> |   256 |  2798      |   7009.6    |  7641.4    |   7653.6    |
> |       |  (103.9)   |    (28.00)  |   (33.26)  |    (25.50)  |
> |  1024 | 10780      |  27040      | 29280      |  29320      |
> |       |  (389.8)   |    (89.44)  |  (130.3)   |    (83.66)  |
> |  4096 | 43700      | 103800      | 48420      | 102000      |
> |       | (1953)     |    (447.2)  | (2012)     |     (0)     |
>   -------------------------------------------------------------
> 
> 16 threads, reads from 1M file(s), mean MiB/s (StdDev)
> 
>   --------------------------------------------------------------
> | Block |  Baseline   |  Baseline   |  Patched    |  Patched   |
> | size  |  same file  |  diff files |  same file  | diff files |
>   ---------------------------------------------------------
> |     1 |     26.38   |     27.34   |     30.38   |    30.36   |
> |       |     (0.998) |     (0.114) |     (0.083) |    (0.089) |
> |    32 |    824.4    |    877.2    |    977.8    |   975.8    |
> |       |    (15.78)  |     (3.271) |     (2.683) |    (1.095) |
> |   256 |   6494      |   6992.8    |   7619.8    |   7625     |
> |       |   (116.0)   |    (32.39)  |    (10.66)  |    (28.19) |
> |  1024 |  24960      |  26840      |  29100      |  29180     |
> |       |   (606.6)   |   (151.6)   |   (122.4)   |    (83.66) |
> |  4096 |  94420      | 100520      |  95260      |  99760     |
> |       |  (3144)     |   (672.3)   |  (2874)     |   (134.1)  |
> | 32768 | 386000      | 402400      | 368600      | 397400     |
> |       | (36599)     | (10526)     | (47188)     |  (6107)    |
>   --------------------------------------------------------------
> 
> There's also improvement on kernel build:
> 
> Base line: 61.3462 +- 0.0597 seconds time elapsed  ( +-  0.10% )
> Patched:   60.6106 +- 0.0759 seconds time elapsed  ( +-  0.13% )
> 
> Co-developed-by: Linus Torvalds <torvalds@...ux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@...ux-foundation.org>
> Signed-off-by: Kiryl Shutsemau <kas@...nel.org>
> ---
>   fs/inode.c         |   2 +
>   include/linux/fs.h |   1 +
>   mm/filemap.c       | 150 ++++++++++++++++++++++++++++++++++++++-------
>   3 files changed, 130 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index ec9339024ac3..52163d28d630 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -482,6 +482,8 @@ EXPORT_SYMBOL(inc_nlink);
>   static void __address_space_init_once(struct address_space *mapping)
>   {
>   	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
> +	seqcount_spinlock_init(&mapping->i_pages_delete_seqcnt,
> +			       &mapping->i_pages->xa_lock);
>   	init_rwsem(&mapping->i_mmap_rwsem);
>   	INIT_LIST_HEAD(&mapping->i_private_list);
>   	spin_lock_init(&mapping->i_private_lock);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c895146c1444..c9588d555f73 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -523,6 +523,7 @@ struct address_space {
>   	struct list_head	i_private_list;
>   	struct rw_semaphore	i_mmap_rwsem;
>   	void *			i_private_data;
> +	seqcount_spinlock_t	i_pages_delete_seqcnt;
>   } __attribute__((aligned(sizeof(long)))) __randomize_layout;
>   	/*
>   	 * On most architectures that alignment is already the case; but
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 13f0259d993c..51689c4f3773 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -138,8 +138,10 @@ static void page_cache_delete(struct address_space *mapping,
>   
>   	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>   
> +	write_seqcount_begin(&mapping->i_pages_delete_seqcnt);
>   	xas_store(&xas, shadow);
>   	xas_init_marks(&xas);
> +	write_seqcount_end(&mapping->i_pages_delete_seqcnt);
>   
>   	folio->mapping = NULL;
>   	/* Leave folio->index set: truncation lookup relies upon it */
> @@ -2695,21 +2697,98 @@ static void filemap_end_dropbehind_read(struct folio *folio)
>   	}
>   }
>   
> -/**
> - * filemap_read - Read data from the page cache.
> - * @iocb: The iocb to read.
> - * @iter: Destination for the data.
> - * @already_read: Number of bytes already read by the caller.
> - *
> - * Copies data from the page cache.  If the data is not currently present,
> - * uses the readahead and read_folio address_space operations to fetch it.
> - *
> - * Return: Total number of bytes copied, including those already read by
> - * the caller.  If an error happens before any bytes are copied, returns
> - * a negative error number.
> - */
> -ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
> -		ssize_t already_read)
> +static inline unsigned long filemap_read_fast_rcu(struct address_space *mapping,
> +						  loff_t pos, char *buffer,
> +						  size_t size)
> +{
> +	XA_STATE(xas, &mapping->i_pages, pos >> PAGE_SHIFT);
> +	struct folio *folio;
> +	loff_t file_size;
> +	unsigned int seq;
> +
> +	lockdep_assert_in_rcu_read_lock();
> +
> +	/* Give up and go to slow path if raced with page_cache_delete() */
> +	if (!raw_seqcount_try_begin(&mapping->i_pages_delete_seqcnt, seq))
> +		return false;
> +
> +	folio = xas_load(&xas);
> +	if (xas_retry(&xas, folio))
> +		return 0;
> +
> +	if (!folio || xa_is_value(folio))
> +		return 0;
> +
> +	if (!folio_test_uptodate(folio))
> +		return 0;
> +
> +	/* No fast-case if readahead is supposed to started */
> +	if (folio_test_readahead(folio))
> +		return 0;
> +	/* .. or mark it accessed */
> +	if (!folio_test_referenced(folio))
> +		return 0;
> +
> +	/* i_size check must be after folio_test_uptodate() */
> +	file_size = i_size_read(mapping->host);
> +	if (unlikely(pos >= file_size))
> +		return 0;
> +	if (size > file_size - pos)
> +		size = file_size - pos;
> +
> +	/* Do the data copy */
> +	size = memcpy_from_file_folio(buffer, folio, pos, size);

We can be racing with folio splitting or freeing+reallocation, right? So 
our folio might actually be suddenly a tail page I assume? Or change 
from small to large, or change from large to small. Or a large folio can 
change its size?

In that case things like folio_test_uptodate() won't be happy, because 
they will bail out if called on a tail page (see PF_NO_TAIL).


But also, are we sure memcpy_from_file_folio() is save to be used in 
that racy context?


offset_in_folio() which depends on folio_size(). When racing with 
concurrent folio reuse, folio_size()->folio_order() can be tricked into 
returning a wrong number I guess and making the result of 
offset_in_folio() rather bogus.

At least that's my understanding after taking a quick look.

The concern I would have in that case
(A) the page pointer we calculate in kmap_local_folio() could be garbage
(B) the "from" pointer we return could be garbage

"garbage" as in pointing at something without a direct map, something 
that's protected differently (MTE? weird CoCo protection?) or even worse 
MMIO with undesired read-effects.


I'll note that some of these concerns would be gone once folios are 
allocated separately: we would not suddenly see a tail page.

But concurrent folio splitting or size changes could still be an issue 
IIUC Willy today once I discussed the plans about "struct folio" freeing.

Maybe I'm wrong, just wanted to raise that these functions are not 
prepared to be called when we are not holding a refcount or cannot 
otherwise guarantee that the folio cannot concurrently be 
freed/reallocated/merged/split.

-- 
Cheers

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ