linux-kernel - Re: [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cbfad787-fcb8-43ce-8fd9-e9495116534d@arm.com>
Date: Tue, 6 May 2025 10:28:11 +0100
From: Ryan Roberts <ryan.roberts@....com>
To: Jan Kara <jack@...e.cz>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@...radead.org>,
 Alexander Viro <viro@...iv.linux.org.uk>,
 Christian Brauner <brauner@...nel.org>, David Hildenbrand
 <david@...hat.com>, Dave Chinner <david@...morbit.com>,
 Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
 Kalesh Singh <kaleshsingh@...gle.com>, Zi Yan <ziy@...dia.com>,
 linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
 linux-fsdevel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on
 natural boundary

On 05/05/2025 10:37, Jan Kara wrote:
> On Mon 05-05-25 11:13:26, Jan Kara wrote:
>> On Wed 30-04-25 15:59:15, Ryan Roberts wrote:
>>> Previously asynchonous readahead would read ra_pages (usually 128K)
>>> directly after the end of the synchonous readahead and given the
>>> synchronous readahead portion had no alignment guarantees (beyond page
>>> boundaries) it is possible (and likely) that the end of the initial 128K
>>> region would not fall on a natural boundary for the folio size being
>>> used. Therefore smaller folios were used to align down to the required
>>> boundary, both at the end of the previous readahead block and at the
>>> start of the new one.
>>>
>>> In the worst cases, this can result in never properly ramping up the
>>> folio size, and instead getting stuck oscillating between order-0, -1
>>> and -2 folios. The next readahead will try to use folios whose order is
>>> +2 bigger than the folio that had the readahead marker. But because of
>>> the alignment requirements, that folio (the first one in the readahead
>>> block) can end up being order-0 in some cases.
>>>
>>> There will be 2 modifications to solve this issue:
>>>
>>> 1) Calculate the readahead size so the end is aligned to a folio
>>>    boundary. This prevents needing to allocate small folios to align
>>>    down at the end of the window and fixes the oscillation problem.
>>>
>>> 2) Remember the "preferred folio order" in the ra state instead of
>>>    inferring it from the folio with the readahead marker. This solves
>>>    the slow ramp up problem (discussed in a subsequent patch).
>>>
>>> This patch addresses (1) only. A subsequent patch will address (2).
>>>
>>> Worked example:
>>>
>>> The following shows the previous pathalogical behaviour when the initial
>>> synchronous readahead is unaligned. We start reading at page 17 in the
>>> file and read sequentially from there. I'm showing a dump of the pages
>>> in the page cache just after we read the first page of the folio with
>>> the readahead marker.
> 
> <snip>
> 
>> Looks good. When I was reading this code some time ago, I also felt we
>> should rather do some rounding instead of creating small folios so thanks
>> for working on this. Feel free to add:
>>
>> Reviewed-by: Jan Kara <jack@...e.cz>
> 
> But now I've also remembered why what you do here isn't an obvious win.
> There are storage devices (mostly RAID arrays) where optimum read size
> isn't a power of 2. Think for example a RAID-0 device composed from three
> disks. It will have max_pages something like 384 (512k * 3). Suppose we are
> on x86 and max_order is 9. Then previously (if we were lucky with
> alignment) we were alternating between order 7 and order 8 pages in the
> page cache and do optimally sized IOs od 1536k. 

Sorry I'm struggling to follow some of this, perhaps my superficial
understanding of all the readahead subtleties is starting to show...

How is the 384 figure provided? I'd guess that comes from bdi->io_pages, and
bdi->ra_pages would remain the usual 32 (128K)? In which case, for mmap, won't
we continue to be limited by ra_pages and will never get beyond order-5? (for
mmap req_size is always set to ra_pages IIRC, so ractl_max_pages() always just
returns ra_pages). Or perhaps ra_pages is set to 384 somewhere, but I'm not
spotting it in the code...

I guess you are also implicitly teaching me something about how the block layer
works here too... if there are 2 read requests for an order-7 and order-8, then
the block layer will merge those to a single read (upto the 384 optimal size?)
but if there are 2 reads of order-8 then it won't merge because it would be
bigger than the optimal size and it won't split the second one at the optimal
size either? Have I inferred that correctly?

> Now you will allocate all
> folios of order 8 (nice) but reads will be just 1024k and you'll see
> noticeable drop in read throughput (not nice). Note that this is not just a
> theoretical example but a real case we have hit when doing performance
> testing of servers and for which I was tweaking readahead code in the past.
> 
> So I think we need to tweak this logic a bit. Perhaps we should round_down
> end to the minimum alignment dictated by 'order' and maxpages? Like:
> 
> 1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1)

Sorry I'm staring at this and struggling to understand the "PAGE_SHIFT - 1" part?

I think what you are suggesting is that the patch becomes something like this:

---8<---
+	end = ra->start + ra->size;
+	aligned_end = round_down(end, 1UL << min(order, ilog2(max_pages)));
+	if (aligned_end > ra->start)
+		ra->size -= end - aligned_end;
+	ra->async_size = ra->size;
---8<---

So if max_pages=384, then aligned_end will be aligned down to a maximum of the
previous 1MB boundary?

Thanks,
Ryan

> 
> If you set badly aligned readahead size manually, you will get small pages
> in the page cache but that's just you being stupid. In practice, hardware
> induced readahead size need not be powers of 2 but they are *sane* :).
> 
> 								Honza
> 
>>> diff --git a/mm/readahead.c b/mm/readahead.c
>>> index 8bb316f5a842..82f9f623f2d7 100644
>>> --- a/mm/readahead.c
>>> +++ b/mm/readahead.c
>>> @@ -625,7 +625,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
>>>  	unsigned long max_pages;
>>>  	struct file_ra_state *ra = ractl->ra;
>>>  	pgoff_t index = readahead_index(ractl);
>>> -	pgoff_t expected, start;
>>> +	pgoff_t expected, start, end, aligned_end;
>>>  	unsigned int order = folio_order(folio);
>>>  
>>>  	/* no readahead */
>>> @@ -657,7 +657,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
>>>  		 * the readahead window.
>>>  		 */
>>>  		ra->size = max(ra->size, get_next_ra_size(ra, max_pages));
>>> -		ra->async_size = ra->size;
>>>  		goto readit;
>>>  	}
>>>  
>>> @@ -678,9 +677,13 @@ void page_cache_async_ra(struct readahead_control *ractl,
>>>  	ra->size = start - index;	/* old async_size */
>>>  	ra->size += req_count;
>>>  	ra->size = get_next_ra_size(ra, max_pages);
>>> -	ra->async_size = ra->size;
>>>  readit:
>>>  	order += 2;
>>> +	end = ra->start + ra->size;
>>> +	aligned_end = round_down(end, 1UL << order);
>>> +	if (aligned_end > ra->start)
>>> +		ra->size -= end - aligned_end;
>>> +	ra->async_size = ra->size;
>>>  	ractl->_index = ra->start;
>>>  	page_cache_ra_order(ractl, ra, order);
>>>  }
>>> -- 
>>> 2.43.0
>>>
>> -- 
>> Jan Kara <jack@...e.com>
>> SUSE Labs, CR