linux-kernel - Re: [PATCH] mm: consider disabling readahead if there are signs of thrashing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <khtydy2lpwip3cysfjiw7sa7effaagpx7tcbvgopqhsdb2h3fc@4xveh6pjxuoq>
Date: Mon, 28 Jul 2025 11:16:40 +0200
From: Jan Kara <jack@...e.cz>
To: Roman Gushchin <roman.gushchin@...ux.dev>
Cc: Jan Kara <jack@...e.cz>, Andrew Morton <akpm@...ux-foundation.org>, 
	Matthew Wilcox <willy@...radead.org>, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	Liu Shixin <liushixin2@...wei.com>
Subject: Re: [PATCH] mm: consider disabling readahead if there are signs of
 thrashing

On Fri 25-07-25 16:25:49, Roman Gushchin wrote:
> Roman Gushchin <roman.gushchin@...ux.dev> writes:
> > Jan Kara <jack@...e.cz> writes:
> >> On Thu 10-07-25 12:52:32, Roman Gushchin wrote:
> >>> We've noticed in production that under a very heavy memory pressure
> >>> the readahead behavior becomes unstable causing spikes in memory
> >>> pressure and CPU contention on zone locks.
> >>> 
> >>> The current mmap_miss heuristics considers minor pagefaults as a
> >>> good reason to decrease mmap_miss and conditionally start async
> >>> readahead. This creates a vicious cycle: asynchronous readahead
> >>> loads more pages, which in turn causes more minor pagefaults.
> >>> This problem is especially pronounced when multiple threads of
> >>> an application fault on consecutive pages of an evicted executable,
> >>> aggressively lowering the mmap_miss counter and preventing readahead
> >>> from being disabled.
> >>
> >> I think you're talking about filemap_map_pages() logic of handling
> >> mmap_miss. It would be nice to mention it in the changelog. There's one
> >> thing that doesn't quite make sense to me: When there's memory pressure,
> >> I'd expect the pages to be reclaimed from memory and not just unmapped. 
> >> Also given your solution uses !uptodate folios suggests the pages were
> >> actually fully reclaimed and the problem really is that filemap_map_pages()
> >> treats as minor page fault (i.e., cache hit) what is in fact a major page
> >> fault (i.e., cache miss)?
> >>
> >> Actually, now that I digged deeper I've remembered that based on Liu
> >> Shixin's report
> >> (https://lore.kernel.org/all/20240201100835.1626685-1-liushixin2@huawei.com/)
> >> which sounds a lot like what you're reporting, we have eventually merged his
> >> fixes (ended up as commits 0fd44ab213bc ("mm/readahead: break read-ahead
> >> loop if filemap_add_folio return -ENOMEM"), 5c46d5319bde ("mm/filemap:
> >> don't decrease mmap_miss when folio has workingset flag")). Did you test a
> >> kernel with these fixes (6.10 or later)? In particular after these fixes
> >> the !folio_test_workingset() check in filemap_map_folio_range() and
> >> filemap_map_order0_folio() should make sure we don't decrease mmap_miss
> >> when faulting fresh pages. Or was in your case page evicted so long ago
> >> that workingset bit is already clear?
> >>
> >> Once we better understand the situation, let me also mention that I have
> >> two patches which I originally proposed to fix Liu's problems. They didn't
> >> quite fix them so his patches got merged in the end but the problems
> >> described there are still somewhat valid:
> >
> > Ok, I got a better understanding of the situation now. Basically we have
> > a multi-threaded application which is under very heavy memory pressure.
> > I multiple threads are faulting simultaneously into the same page,
> > do_sync_mmap_readahead() can be called multiple times for the same page.
> > This creates a negative pressure on the mmap_miss counter, which can't be
> > matched by do_sync_mmap_readahead(), which is be called only once
> > for every page. This basically keeps the readahead on, despite the heavy
> > memory pressure.
> >
> > The following patch solves the problem, at least in my test scenario.
> > Wdyt?
> 
> Actually, a better version is below. We don't have to avoid the actual
> readahead, just not decrease mmap_miss if the page is locked.
> 
> --
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 0d0369fb5fa1..1756690dd275 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3323,9 +3323,15 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
>         if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
>                 return fpin;
>  
> -       mmap_miss = READ_ONCE(ra->mmap_miss);
> -       if (mmap_miss)
> -               WRITE_ONCE(ra->mmap_miss, --mmap_miss);
> +       /* If folio is locked, we're likely racing against another fault,
> +        * don't decrease the mmap_miss counter to avoid decreasing it
> +        * multiple times for the same page and break the balance.
> +        */
> +       if (likely(!folio_test_locked(folio))) {

I like this, although even more understandable to me would be to have

	  if (likely(folio_test_uptodate(folio)))

which should be more or less equivalent for your situation but would better
express, whether this is indeed a cache hit or not. But I can live with
either variant.

								Honza

> +               mmap_miss = READ_ONCE(ra->mmap_miss);
> +               if (mmap_miss)
> +                       WRITE_ONCE(ra->mmap_miss, --mmap_miss);
> +       }
>  
>         if (folio_test_readahead(folio)) {
>                 fpin = maybe_unlock_mmap_for_io(vmf, fpin);
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR