lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZeoFQnVYLLBLNL6J@casper.infradead.org>
Date: Thu, 7 Mar 2024 18:19:46 +0000
From: Matthew Wilcox <willy@...radead.org>
To: Jan Kara <jack@...e.cz>
Cc: "Yin, Fengwei" <fengwei.yin@...el.com>, Yujie Liu <yujie.liu@...el.com>,
	Oliver Sang <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
	lkp@...el.com, linux-kernel@...r.kernel.org,
	Andrew Morton <akpm@...ux-foundation.org>,
	Guo Xuenan <guoxuenan@...wei.com>, linux-fsdevel@...r.kernel.org,
	ying.huang@...el.com, feng.tang@...el.com
Subject: Re: [linus:master] [readahead] ab4443fe3c: vm-scalability.throughput
 -21.4% regression

On Thu, Mar 07, 2024 at 10:23:08AM +0100, Jan Kara wrote:
> Thanks for testing! This is an interesting result and certainly unexpected
> for me. The readahead code allocates naturally aligned pages so based on
> the distribution of allocations it seems that before commit ab4443fe3ca6
> readahead window was at least 32 pages (128KB) aligned and so we allocated
> order 5 pages. After the commit, the readahead window somehow ended up only
> aligned to 20 modulo 32. To follow natural alignment and fill 128KB
> readahead window we allocated order 2 page (got us to offset 24 modulo 32),
> then order 3 page (got us to offset 0 modulo 32), order 4 page (larger
> would not fit in 128KB readahead window now), and order 2 page to finish
> filling the readahead window.
> 
> Now I'm not 100% sure why the readahead window alignment changed with
> different rounding when placing readahead mark - probably that's some
> artifact when readahead window is tiny in the beginning before we scale it
> up (I'll verify by tracing whether everything ends up looking correctly
> with the current code). So I don't expect this is a problem in ab4443fe3ca6
> as such but it exposes the issue that readahead page insertion code should
> perhaps strive to achieve better readahead window alignment with logical
> file offset even at the cost of occasionally performing somewhat shorter
> readahead. I'll look into this once I dig out of the huge heap of email
> after vacation...

I was surprised by what you said here, so I went and re-read the code
and it doesn't work the way I thought it did.  So I had a good long think
about how it _should_ work, and I looked for some more corner conditions,
and this is what I came up with.

The first thing I've done is separate out the two limits.  The EOF is
a hard limit; we will not allocate pages beyond EOF.  The ra->size is
a soft limit; we will allocate pages beyond ra->size, but not too far.

The second thing I noticed is that index + ra_size could wrap.  So add
a check for that, and set it to ULONG_MAX.  index + ra_size - async_size
could also wrap, but this is harmless.  We certainly don't want to kick
off any more readahead in this circumstance, so leaving 'mark' outside
the range [index..ULONG_MAX] is just fine.

The third thing is that we could allocate a folio which contains a page
at ULONG_MAX.  We don't really want that in the page cache; it makes
filesystems more complicated if they have to check for that, and we
don't allow an order-0 folio at ULONG_MAX, so there's no need for it.
This _should_ already be prohibited by the "Don't allocate pages past EOF"
check, but let's explicitly prohibit it.

Compile tested only.

diff --git a/mm/readahead.c b/mm/readahead.c
index 130c0e7df99f..742e1f39035b 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -488,7 +488,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
 {
 	struct address_space *mapping = ractl->mapping;
 	pgoff_t index = readahead_index(ractl);
-	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
+	pgoff_t last = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
+	pgoff_t limit = index + ra->size;
 	pgoff_t mark = index + ra->size - ra->async_size;
 	int err = 0;
 	gfp_t gfp = readahead_gfp_mask(mapping);
@@ -496,23 +497,26 @@ void page_cache_ra_order(struct readahead_control *ractl,
 	if (!mapping_large_folio_support(mapping) || ra->size < 4)
 		goto fallback;
 
-	limit = min(limit, index + ra->size - 1);
-
 	if (new_order < MAX_PAGECACHE_ORDER) {
 		new_order += 2;
 		new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
 		new_order = min_t(unsigned int, new_order, ilog2(ra->size));
 	}
 
+	if (limit < index)
+		limit = ULONG_MAX;
 	filemap_invalidate_lock_shared(mapping);
-	while (index <= limit) {
+	while (index < limit) {
 		unsigned int order = new_order;
 
 		/* Align with smaller pages if needed */
 		if (index & ((1UL << order) - 1))
 			order = __ffs(index);
+		/* Avoid wrap */
+		if (index + (1UL << order) == 0)
+			order--;
 		/* Don't allocate pages past EOF */
-		while (index + (1UL << order) - 1 > limit)
+		while (index + (1UL << order) - 1 > last)
 			order--;
 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
 		if (err)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ