[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <48341947-dd12-4a89-870d-fb73f5121888@linux.intel.com>
Date: Fri, 7 Nov 2025 18:28:00 +0800
From: Aubrey Li <aubrey.li@...ux.intel.com>
To: Jan Kara <jack@...e.cz>, Andrew Morton <akpm@...ux-foundation.org>
Cc: Matthew Wilcox <willy@...radead.org>, Nanhai Zou <nanhai.zou@...el.com>,
Gang Deng <gang.deng@...el.com>, Tianyou Li <tianyou.li@...el.com>,
Vinicius Gomes <vinicius.gomes@...el.com>,
Tim Chen <tim.c.chen@...ux.intel.com>, Chen Yu <yu.c.chen@...el.com>,
linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Roman Gushchin <roman.gushchin@...ux.dev>
Subject: Re: [PATCH] mm/readahead: Skip fully overlapped range
Really sorry for the late, too. Thunderbird collapsed this thread, but didn't
highlight it as unread, I thought no one response, :(
On 10/17/25 12:21 AM, Jan Kara wrote:
> Sorry for not replying earlier. I wanted make up my mind about this and
> other stuff was keeping preempting me...
>
> On Sat 11-10-25 15:20:42, Andrew Morton wrote:
>> On Tue, 30 Sep 2025 13:35:43 +0800 Aubrey Li <aubrey.li@...ux.intel.com> wrote:
>>
>>> file_ra_state is considered a performance hint, not a critical correctness
>>> field. The race conditions on file's readahead state don't affect the
>>> correctness of file I/O because later the page cache mechanisms ensure data
>>> consistency, it won't cause wrong data to be read. I think that's why we do
>>> not lock file_ra_state today, to avoid performance penalties on this hot path.
>>>
>>> That said, this patch didn't make things worse, and it does take a risk but
>>> brings the rewards of RocksDB's readseq benchmark.
>>
>> So if I may summarize:
>>
>> - you've identifed and addressed an issue with concurrent readahead
>> against an fd
>
> Right but let me also note that the patch modifies only
> force_page_cache_ra() which is a pretty peculiar function. It's used at two
> places:
> 1) When page_cache_sync_ra() decides it isn't worth to do a proper
> readahead and just wants to read that one one.
>
> 2) From POSIX_FADV_WILLNEED - I suppose this is Aubrey's case.
>
> As such it seems to be fixing mostly a "don't do it when it hurts" kind of
> load from the benchmark than a widely used practical case since I'm not
> sure many programs call POSIX_FADV_WILLNEED from many threads in parallel
> for the same range.
>
>> - Jan points out that we don't properly handle concurrent access to a
>> file's ra_state. This is somewhat offtopic, but we should address
>> this sometime anyway. Then we can address the RocksDB issue later.
>>
>> Another practicality: improving a benchmark is nice, but do we have any
>> reasons to believe that this change will improve any real-world
>> workload? If so, which and by how much?
I only have RocksDB on my side, but this isn't a lab case but a real case.
It's an issue reported by a customer. They use this case to stress test the
system under high-concurrency data workloads, it could have business impact.
>
> The problem I had with the patch is that it adds more racy updates & checks
> for the shared ra state so it's kind of difficult to say whether some
> workload will not now more often clobber the ra state resulting in poor
> readahead behavior. Also as I looked into the patch now another objection I
> have is that force_page_cache_ra() previously didn't touch the ra state at
> all, it just read the requested pages. After the patch
> force_page_cache_ra() will destroy the readahead state completely. This is
> definitely something we don't want to do.
This is also something I worried about, so I added two trace points at the
entry and exit of force_page_cache_ra(), and I got all ZEROs.
test-9858 [018] ..... 554.352691: force_page_cache_ra: force_page_cache_ra entry: ra->start = 0, ra->size = 0
test-9858 [018] ..... 554.352695: force_page_cache_ra: force_page_cache_ra exit: ra->start = 0, ra->size = 0
test-9855 [009] ..... 554.352701: force_page_cache_ra: force_page_cache_ra entry: ra->start = 0, ra->size = 0
test-9855 [009] ..... 554.352705: force_page_cache_ra: force_page_cache_ra exit: ra->start = 0, ra->size = 0
I think for this code path, my patch doesn't break anything. Do we have any
other code paths I can check?
Anyway, thanks Andrew and Jan for the detailed feedback and discussion. if
we later plan to make file_ra_state concurrency-safe first, I'd be happy to
help test or rebase this optimization on top of that work.
Thanks,
-Aubrey
Powered by blists - more mailing lists