linux-kernel - Re: [PATCH] mm: consider disabling readahead if there are signs of thrashing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87jz3vdf9e.fsf@linux.dev>
Date: Fri, 25 Jul 2025 16:25:49 -0700
From: Roman Gushchin <roman.gushchin@...ux.dev>
To: Jan Kara <jack@...e.cz>
Cc: Andrew Morton <akpm@...ux-foundation.org>,  Matthew Wilcox
 <willy@...radead.org>,  linux-mm@...ck.org,  linux-kernel@...r.kernel.org,
  Liu Shixin <liushixin2@...wei.com>
Subject: Re: [PATCH] mm: consider disabling readahead if there are signs of
 thrashing

Roman Gushchin <roman.gushchin@...ux.dev> writes:

> Jan Kara <jack@...e.cz> writes:
>
>> On Thu 10-07-25 12:52:32, Roman Gushchin wrote:
>>> We've noticed in production that under a very heavy memory pressure
>>> the readahead behavior becomes unstable causing spikes in memory
>>> pressure and CPU contention on zone locks.
>>> 
>>> The current mmap_miss heuristics considers minor pagefaults as a
>>> good reason to decrease mmap_miss and conditionally start async
>>> readahead. This creates a vicious cycle: asynchronous readahead
>>> loads more pages, which in turn causes more minor pagefaults.
>>> This problem is especially pronounced when multiple threads of
>>> an application fault on consecutive pages of an evicted executable,
>>> aggressively lowering the mmap_miss counter and preventing readahead
>>> from being disabled.
>>
>> I think you're talking about filemap_map_pages() logic of handling
>> mmap_miss. It would be nice to mention it in the changelog. There's one
>> thing that doesn't quite make sense to me: When there's memory pressure,
>> I'd expect the pages to be reclaimed from memory and not just unmapped. 
>> Also given your solution uses !uptodate folios suggests the pages were
>> actually fully reclaimed and the problem really is that filemap_map_pages()
>> treats as minor page fault (i.e., cache hit) what is in fact a major page
>> fault (i.e., cache miss)?
>>
>> Actually, now that I digged deeper I've remembered that based on Liu
>> Shixin's report
>> (https://lore.kernel.org/all/20240201100835.1626685-1-liushixin2@huawei.com/)
>> which sounds a lot like what you're reporting, we have eventually merged his
>> fixes (ended up as commits 0fd44ab213bc ("mm/readahead: break read-ahead
>> loop if filemap_add_folio return -ENOMEM"), 5c46d5319bde ("mm/filemap:
>> don't decrease mmap_miss when folio has workingset flag")). Did you test a
>> kernel with these fixes (6.10 or later)? In particular after these fixes
>> the !folio_test_workingset() check in filemap_map_folio_range() and
>> filemap_map_order0_folio() should make sure we don't decrease mmap_miss
>> when faulting fresh pages. Or was in your case page evicted so long ago
>> that workingset bit is already clear?
>>
>> Once we better understand the situation, let me also mention that I have
>> two patches which I originally proposed to fix Liu's problems. They didn't
>> quite fix them so his patches got merged in the end but the problems
>> described there are still somewhat valid:
>
> Ok, I got a better understanding of the situation now. Basically we have
> a multi-threaded application which is under very heavy memory pressure.
> I multiple threads are faulting simultaneously into the same page,
> do_sync_mmap_readahead() can be called multiple times for the same page.
> This creates a negative pressure on the mmap_miss counter, which can't be
> matched by do_sync_mmap_readahead(), which is be called only once
> for every page. This basically keeps the readahead on, despite the heavy
> memory pressure.
>
> The following patch solves the problem, at least in my test scenario.
> Wdyt?

Actually, a better version is below. We don't have to avoid the actual
readahead, just not decrease mmap_miss if the page is locked.

--

diff --git a/mm/filemap.c b/mm/filemap.c
index 0d0369fb5fa1..1756690dd275 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3323,9 +3323,15 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
        if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
                return fpin;
 
-       mmap_miss = READ_ONCE(ra->mmap_miss);
-       if (mmap_miss)
-               WRITE_ONCE(ra->mmap_miss, --mmap_miss);
+       /* If folio is locked, we're likely racing against another fault,
+        * don't decrease the mmap_miss counter to avoid decreasing it
+        * multiple times for the same page and break the balance.
+        */
+       if (likely(!folio_test_locked(folio))) {
+               mmap_miss = READ_ONCE(ra->mmap_miss);
+               if (mmap_miss)
+                       WRITE_ONCE(ra->mmap_miss, --mmap_miss);
+       }
 
        if (folio_test_readahead(folio)) {
                fpin = maybe_unlock_mmap_for_io(vmf, fpin);