[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZeoFQnVYLLBLNL6J@casper.infradead.org>
Date: Thu, 7 Mar 2024 18:19:46 +0000
From: Matthew Wilcox <willy@...radead.org>
To: Jan Kara <jack@...e.cz>
Cc: "Yin, Fengwei" <fengwei.yin@...el.com>, Yujie Liu <yujie.liu@...el.com>,
Oliver Sang <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
lkp@...el.com, linux-kernel@...r.kernel.org,
Andrew Morton <akpm@...ux-foundation.org>,
Guo Xuenan <guoxuenan@...wei.com>, linux-fsdevel@...r.kernel.org,
ying.huang@...el.com, feng.tang@...el.com
Subject: Re: [linus:master] [readahead] ab4443fe3c: vm-scalability.throughput
-21.4% regression
On Thu, Mar 07, 2024 at 10:23:08AM +0100, Jan Kara wrote:
> Thanks for testing! This is an interesting result and certainly unexpected
> for me. The readahead code allocates naturally aligned pages so based on
> the distribution of allocations it seems that before commit ab4443fe3ca6
> readahead window was at least 32 pages (128KB) aligned and so we allocated
> order 5 pages. After the commit, the readahead window somehow ended up only
> aligned to 20 modulo 32. To follow natural alignment and fill 128KB
> readahead window we allocated order 2 page (got us to offset 24 modulo 32),
> then order 3 page (got us to offset 0 modulo 32), order 4 page (larger
> would not fit in 128KB readahead window now), and order 2 page to finish
> filling the readahead window.
>
> Now I'm not 100% sure why the readahead window alignment changed with
> different rounding when placing readahead mark - probably that's some
> artifact when readahead window is tiny in the beginning before we scale it
> up (I'll verify by tracing whether everything ends up looking correctly
> with the current code). So I don't expect this is a problem in ab4443fe3ca6
> as such but it exposes the issue that readahead page insertion code should
> perhaps strive to achieve better readahead window alignment with logical
> file offset even at the cost of occasionally performing somewhat shorter
> readahead. I'll look into this once I dig out of the huge heap of email
> after vacation...
I was surprised by what you said here, so I went and re-read the code
and it doesn't work the way I thought it did. So I had a good long think
about how it _should_ work, and I looked for some more corner conditions,
and this is what I came up with.
The first thing I've done is separate out the two limits. The EOF is
a hard limit; we will not allocate pages beyond EOF. The ra->size is
a soft limit; we will allocate pages beyond ra->size, but not too far.
The second thing I noticed is that index + ra_size could wrap. So add
a check for that, and set it to ULONG_MAX. index + ra_size - async_size
could also wrap, but this is harmless. We certainly don't want to kick
off any more readahead in this circumstance, so leaving 'mark' outside
the range [index..ULONG_MAX] is just fine.
The third thing is that we could allocate a folio which contains a page
at ULONG_MAX. We don't really want that in the page cache; it makes
filesystems more complicated if they have to check for that, and we
don't allow an order-0 folio at ULONG_MAX, so there's no need for it.
This _should_ already be prohibited by the "Don't allocate pages past EOF"
check, but let's explicitly prohibit it.
Compile tested only.
diff --git a/mm/readahead.c b/mm/readahead.c
index 130c0e7df99f..742e1f39035b 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -488,7 +488,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
{
struct address_space *mapping = ractl->mapping;
pgoff_t index = readahead_index(ractl);
- pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
+ pgoff_t last = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
+ pgoff_t limit = index + ra->size;
pgoff_t mark = index + ra->size - ra->async_size;
int err = 0;
gfp_t gfp = readahead_gfp_mask(mapping);
@@ -496,23 +497,26 @@ void page_cache_ra_order(struct readahead_control *ractl,
if (!mapping_large_folio_support(mapping) || ra->size < 4)
goto fallback;
- limit = min(limit, index + ra->size - 1);
-
if (new_order < MAX_PAGECACHE_ORDER) {
new_order += 2;
new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
new_order = min_t(unsigned int, new_order, ilog2(ra->size));
}
+ if (limit < index)
+ limit = ULONG_MAX;
filemap_invalidate_lock_shared(mapping);
- while (index <= limit) {
+ while (index < limit) {
unsigned int order = new_order;
/* Align with smaller pages if needed */
if (index & ((1UL << order) - 1))
order = __ffs(index);
+ /* Avoid wrap */
+ if (index + (1UL << order) == 0)
+ order--;
/* Don't allocate pages past EOF */
- while (index + (1UL << order) - 1 > limit)
+ while (index + (1UL << order) - 1 > last)
order--;
err = ra_alloc_folio(ractl, index, mark, order, gfp);
if (err)
Powered by blists - more mailing lists