[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <875xffsxj4.fsf@linux.dev>
Date: Fri, 25 Jul 2025 15:42:07 -0700
From: Roman Gushchin <roman.gushchin@...ux.dev>
To: Jan Kara <jack@...e.cz>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Matthew Wilcox
<willy@...radead.org>, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Liu Shixin <liushixin2@...wei.com>
Subject: Re: [PATCH] mm: consider disabling readahead if there are signs of
thrashing
Jan Kara <jack@...e.cz> writes:
> On Thu 10-07-25 12:52:32, Roman Gushchin wrote:
>> We've noticed in production that under a very heavy memory pressure
>> the readahead behavior becomes unstable causing spikes in memory
>> pressure and CPU contention on zone locks.
>>
>> The current mmap_miss heuristics considers minor pagefaults as a
>> good reason to decrease mmap_miss and conditionally start async
>> readahead. This creates a vicious cycle: asynchronous readahead
>> loads more pages, which in turn causes more minor pagefaults.
>> This problem is especially pronounced when multiple threads of
>> an application fault on consecutive pages of an evicted executable,
>> aggressively lowering the mmap_miss counter and preventing readahead
>> from being disabled.
>
> I think you're talking about filemap_map_pages() logic of handling
> mmap_miss. It would be nice to mention it in the changelog. There's one
> thing that doesn't quite make sense to me: When there's memory pressure,
> I'd expect the pages to be reclaimed from memory and not just unmapped.
> Also given your solution uses !uptodate folios suggests the pages were
> actually fully reclaimed and the problem really is that filemap_map_pages()
> treats as minor page fault (i.e., cache hit) what is in fact a major page
> fault (i.e., cache miss)?
>
> Actually, now that I digged deeper I've remembered that based on Liu
> Shixin's report
> (https://lore.kernel.org/all/20240201100835.1626685-1-liushixin2@huawei.com/)
> which sounds a lot like what you're reporting, we have eventually merged his
> fixes (ended up as commits 0fd44ab213bc ("mm/readahead: break read-ahead
> loop if filemap_add_folio return -ENOMEM"), 5c46d5319bde ("mm/filemap:
> don't decrease mmap_miss when folio has workingset flag")). Did you test a
> kernel with these fixes (6.10 or later)? In particular after these fixes
> the !folio_test_workingset() check in filemap_map_folio_range() and
> filemap_map_order0_folio() should make sure we don't decrease mmap_miss
> when faulting fresh pages. Or was in your case page evicted so long ago
> that workingset bit is already clear?
>
> Once we better understand the situation, let me also mention that I have
> two patches which I originally proposed to fix Liu's problems. They didn't
> quite fix them so his patches got merged in the end but the problems
> described there are still somewhat valid:
Ok, I got a better understanding of the situation now. Basically we have
a multi-threaded application which is under very heavy memory pressure.
I multiple threads are faulting simultaneously into the same page,
do_sync_mmap_readahead() can be called multiple times for the same page.
This creates a negative pressure on the mmap_miss counter, which can't be
matched by do_sync_mmap_readahead(), which is be called only once
for every page. This basically keeps the readahead on, despite the heavy
memory pressure.
The following patch solves the problem, at least in my test scenario.
Wdyt?
Thanks!
--
@@ -3323,6 +3323,10 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
return fpin;
+ /* We're likely racing against another fault, bail out */
+ if (folio_test_locked(folio) && !folio_test_uptodate(folio))
+ return fpin;
+
mmap_miss = READ_ONCE(ra->mmap_miss);
if (mmap_miss)
WRITE_ONCE(ra->mmap_miss, --mmap_miss);
Powered by blists - more mailing lists