linux-kernel - Re: [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <mq7vno6v7mrrquya4kogseej4fasfyq574ersgdxdhateho7md@bvmy6y4ccgyz>
Date: Tue, 6 May 2025 13:29:13 +0200
From: Jan Kara <jack@...e.cz>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Jan Kara <jack@...e.cz>, Andrew Morton <akpm@...ux-foundation.org>, 
	"Matthew Wilcox (Oracle)" <willy@...radead.org>, Alexander Viro <viro@...iv.linux.org.uk>, 
	Christian Brauner <brauner@...nel.org>, David Hildenbrand <david@...hat.com>, 
	Dave Chinner <david@...morbit.com>, Catalin Marinas <catalin.marinas@....com>, 
	Will Deacon <will@...nel.org>, Kalesh Singh <kaleshsingh@...gle.com>, Zi Yan <ziy@...dia.com>, 
	linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org, 
	linux-mm@...ck.org
Subject: Re: [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on
 natural boundary

On Tue 06-05-25 10:28:11, Ryan Roberts wrote:
> On 05/05/2025 10:37, Jan Kara wrote:
> > On Mon 05-05-25 11:13:26, Jan Kara wrote:
> >> On Wed 30-04-25 15:59:15, Ryan Roberts wrote:
> >>> Previously asynchonous readahead would read ra_pages (usually 128K)
> >>> directly after the end of the synchonous readahead and given the
> >>> synchronous readahead portion had no alignment guarantees (beyond page
> >>> boundaries) it is possible (and likely) that the end of the initial 128K
> >>> region would not fall on a natural boundary for the folio size being
> >>> used. Therefore smaller folios were used to align down to the required
> >>> boundary, both at the end of the previous readahead block and at the
> >>> start of the new one.
> >>>
> >>> In the worst cases, this can result in never properly ramping up the
> >>> folio size, and instead getting stuck oscillating between order-0, -1
> >>> and -2 folios. The next readahead will try to use folios whose order is
> >>> +2 bigger than the folio that had the readahead marker. But because of
> >>> the alignment requirements, that folio (the first one in the readahead
> >>> block) can end up being order-0 in some cases.
> >>>
> >>> There will be 2 modifications to solve this issue:
> >>>
> >>> 1) Calculate the readahead size so the end is aligned to a folio
> >>>    boundary. This prevents needing to allocate small folios to align
> >>>    down at the end of the window and fixes the oscillation problem.
> >>>
> >>> 2) Remember the "preferred folio order" in the ra state instead of
> >>>    inferring it from the folio with the readahead marker. This solves
> >>>    the slow ramp up problem (discussed in a subsequent patch).
> >>>
> >>> This patch addresses (1) only. A subsequent patch will address (2).
> >>>
> >>> Worked example:
> >>>
> >>> The following shows the previous pathalogical behaviour when the initial
> >>> synchronous readahead is unaligned. We start reading at page 17 in the
> >>> file and read sequentially from there. I'm showing a dump of the pages
> >>> in the page cache just after we read the first page of the folio with
> >>> the readahead marker.
> > 
> > <snip>
> > 
> >> Looks good. When I was reading this code some time ago, I also felt we
> >> should rather do some rounding instead of creating small folios so thanks
> >> for working on this. Feel free to add:
> >>
> >> Reviewed-by: Jan Kara <jack@...e.cz>
> > 
> > But now I've also remembered why what you do here isn't an obvious win.
> > There are storage devices (mostly RAID arrays) where optimum read size
> > isn't a power of 2. Think for example a RAID-0 device composed from three
> > disks. It will have max_pages something like 384 (512k * 3). Suppose we are
> > on x86 and max_order is 9. Then previously (if we were lucky with
> > alignment) we were alternating between order 7 and order 8 pages in the
> > page cache and do optimally sized IOs od 1536k. 
> 
> Sorry I'm struggling to follow some of this, perhaps my superficial
> understanding of all the readahead subtleties is starting to show...
> 
> How is the 384 figure provided? I'd guess that comes from bdi->io_pages, and
> bdi->ra_pages would remain the usual 32 (128K)?

Sorry, I have been probably too brief in my previous message :)
bdi->ra_pages is actually set based on optimal IO size reported by the
hardware (see blk_apply_bdi_limits() and how its callers are filling in
lim->io_opt). The 128K you speak about is just a last-resort value if
hardware doesn't provide one. And some storage devices do report optimal IO
size that is not power of two.

Also note that bdi->ra_pages can be tuned in sysfs and a lot of users
actually do this (usually from their udev rules). We don't have to perform
well when some odd value gets set but you definitely cannot assume
bdi->ra_pages is 128K :).

> In which case, for mmap, won't
> we continue to be limited by ra_pages and will never get beyond order-5? (for
> mmap req_size is always set to ra_pages IIRC, so ractl_max_pages() always just
> returns ra_pages). Or perhaps ra_pages is set to 384 somewhere, but I'm not
> spotting it in the code...
> 
> I guess you are also implicitly teaching me something about how the block layer
> works here too... if there are 2 read requests for an order-7 and order-8, then
> the block layer will merge those to a single read (upto the 384 optimal size?)

Correct. In fact readahead code will already perform this merging when
submitting the IO.

> but if there are 2 reads of order-8 then it won't merge because it would be
> bigger than the optimal size and it won't split the second one at the optimal
> size either? Have I inferred that correctly?

With the code as you modify it, you would round down ra->size from 384 to
256 and submit only one 1MB sized IO (with one order-8 page). And this will
cause regression in read throughput for such devices because they now don't
get buffer large enough to run at full speed.

> > Now you will allocate all
> > folios of order 8 (nice) but reads will be just 1024k and you'll see
> > noticeable drop in read throughput (not nice). Note that this is not just a
> > theoretical example but a real case we have hit when doing performance
> > testing of servers and for which I was tweaking readahead code in the past.
> > 
> > So I think we need to tweak this logic a bit. Perhaps we should round_down
> > end to the minimum alignment dictated by 'order' and maxpages? Like:
> > 
> > 1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1)
> 
> Sorry I'm staring at this and struggling to understand the "PAGE_SHIFT -
> 1" part?

My bad. It should have been:

1 << min(order, ffs(max_pages) - 1)

> I think what you are suggesting is that the patch becomes something like
> this:
> 
> ---8<---
> +	end = ra->start + ra->size;
> +	aligned_end = round_down(end, 1UL << min(order, ilog2(max_pages)));

Not quite. ilog2() returns the most significant bit set but we really want
to align to the least significant bit set. So when max_pages is 384, we
want to align to at most order-7 (aligning the end more does not make sense
when you want to do IO 384 pages large). That's why I'm using ffs() and not
ilog2().

> +	if (aligned_end > ra->start)
> +		ra->size -= end - aligned_end;
> +	ra->async_size = ra->size;
> ---8<---
> 
> So if max_pages=384, then aligned_end will be aligned down to a maximum
> of the previous 1MB boundary?

No, it needs to be aligned only to previous 512K boundary because we want
to do IOs 3*512K large.

Hope things are a bit clearer now :)

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR