[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <khtydy2lpwip3cysfjiw7sa7effaagpx7tcbvgopqhsdb2h3fc@4xveh6pjxuoq>
Date: Mon, 28 Jul 2025 11:16:40 +0200
From: Jan Kara <jack@...e.cz>
To: Roman Gushchin <roman.gushchin@...ux.dev>
Cc: Jan Kara <jack@...e.cz>, Andrew Morton <akpm@...ux-foundation.org>,
Matthew Wilcox <willy@...radead.org>, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Liu Shixin <liushixin2@...wei.com>
Subject: Re: [PATCH] mm: consider disabling readahead if there are signs of
thrashing
On Fri 25-07-25 16:25:49, Roman Gushchin wrote:
> Roman Gushchin <roman.gushchin@...ux.dev> writes:
> > Jan Kara <jack@...e.cz> writes:
> >> On Thu 10-07-25 12:52:32, Roman Gushchin wrote:
> >>> We've noticed in production that under a very heavy memory pressure
> >>> the readahead behavior becomes unstable causing spikes in memory
> >>> pressure and CPU contention on zone locks.
> >>>
> >>> The current mmap_miss heuristics considers minor pagefaults as a
> >>> good reason to decrease mmap_miss and conditionally start async
> >>> readahead. This creates a vicious cycle: asynchronous readahead
> >>> loads more pages, which in turn causes more minor pagefaults.
> >>> This problem is especially pronounced when multiple threads of
> >>> an application fault on consecutive pages of an evicted executable,
> >>> aggressively lowering the mmap_miss counter and preventing readahead
> >>> from being disabled.
> >>
> >> I think you're talking about filemap_map_pages() logic of handling
> >> mmap_miss. It would be nice to mention it in the changelog. There's one
> >> thing that doesn't quite make sense to me: When there's memory pressure,
> >> I'd expect the pages to be reclaimed from memory and not just unmapped.
> >> Also given your solution uses !uptodate folios suggests the pages were
> >> actually fully reclaimed and the problem really is that filemap_map_pages()
> >> treats as minor page fault (i.e., cache hit) what is in fact a major page
> >> fault (i.e., cache miss)?
> >>
> >> Actually, now that I digged deeper I've remembered that based on Liu
> >> Shixin's report
> >> (https://lore.kernel.org/all/20240201100835.1626685-1-liushixin2@huawei.com/)
> >> which sounds a lot like what you're reporting, we have eventually merged his
> >> fixes (ended up as commits 0fd44ab213bc ("mm/readahead: break read-ahead
> >> loop if filemap_add_folio return -ENOMEM"), 5c46d5319bde ("mm/filemap:
> >> don't decrease mmap_miss when folio has workingset flag")). Did you test a
> >> kernel with these fixes (6.10 or later)? In particular after these fixes
> >> the !folio_test_workingset() check in filemap_map_folio_range() and
> >> filemap_map_order0_folio() should make sure we don't decrease mmap_miss
> >> when faulting fresh pages. Or was in your case page evicted so long ago
> >> that workingset bit is already clear?
> >>
> >> Once we better understand the situation, let me also mention that I have
> >> two patches which I originally proposed to fix Liu's problems. They didn't
> >> quite fix them so his patches got merged in the end but the problems
> >> described there are still somewhat valid:
> >
> > Ok, I got a better understanding of the situation now. Basically we have
> > a multi-threaded application which is under very heavy memory pressure.
> > I multiple threads are faulting simultaneously into the same page,
> > do_sync_mmap_readahead() can be called multiple times for the same page.
> > This creates a negative pressure on the mmap_miss counter, which can't be
> > matched by do_sync_mmap_readahead(), which is be called only once
> > for every page. This basically keeps the readahead on, despite the heavy
> > memory pressure.
> >
> > The following patch solves the problem, at least in my test scenario.
> > Wdyt?
>
> Actually, a better version is below. We don't have to avoid the actual
> readahead, just not decrease mmap_miss if the page is locked.
>
> --
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 0d0369fb5fa1..1756690dd275 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3323,9 +3323,15 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
> if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
> return fpin;
>
> - mmap_miss = READ_ONCE(ra->mmap_miss);
> - if (mmap_miss)
> - WRITE_ONCE(ra->mmap_miss, --mmap_miss);
> + /* If folio is locked, we're likely racing against another fault,
> + * don't decrease the mmap_miss counter to avoid decreasing it
> + * multiple times for the same page and break the balance.
> + */
> + if (likely(!folio_test_locked(folio))) {
I like this, although even more understandable to me would be to have
if (likely(folio_test_uptodate(folio)))
which should be more or less equivalent for your situation but would better
express, whether this is indeed a cache hit or not. But I can live with
either variant.
Honza
> + mmap_miss = READ_ONCE(ra->mmap_miss);
> + if (mmap_miss)
> + WRITE_ONCE(ra->mmap_miss, --mmap_miss);
> + }
>
> if (folio_test_readahead(folio)) {
> fpin = maybe_unlock_mmap_for_io(vmf, fpin);
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists