[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250714152247.GB991@cmpxchg.org>
Date: Mon, 14 Jul 2025 11:22:47 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Roman Gushchin <roman.gushchin@...ux.dev>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Shakeel Butt <shakeel.butt@...ux.dev>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Michal Hocko <mhocko@...nel.org>,
David Hildenbrand <david@...hat.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm: skip lru_note_cost() when scanning only file or anon
On Fri, Jul 11, 2025 at 10:55:48AM -0700, Roman Gushchin wrote:
> Johannes Weiner <hannes@...xchg.org> writes:
> > The caveat with this patch is that, aside from the static noswap
> > scenario, modes can switch back and forth abruptly or even overlap.
> >
> > So if you leave a pressure scenario and go back to cache trimming, you
> > will no longer age the cost information anymore. The next spike could
> > be starting out with potentially quite stale information.
> >
> > Or say proactive reclaim recently already targeted anon, and there
> > were rotations and pageouts; that would be useful data for a reactive
> > reclaimer doing work at around the same time, or shortly thereafter.
>
> Agree, but at the same time it's possible to come up with the scenario
> when it's not good.
> A
> / \
> B C memory.max=X
> / \
> D E
>
> Let's say we have a cgroup structure like this, we apply a lot
> of proactive anon pressure on E, then the pressure from on D from
> C's limit will be biased towards file without a good reason.
No, this is on purpose. D and E are not independent. They're in the
same memory domain, C. So if you want to reclaim C, and a subset of
its anon has already been pressured to resistance, then a larger part
of the reclaim candidates in C will need to come from file.
> Or as in my case, if a cgroup has memory.memsw.limit set and is
> thrashing, does it makes sense to bias the rest of the system
> into anon reclaim? The recorded cost can really large.
>
> >
> > So for everything but the static noswap case, the patch makes me
> > nervous. And I'm not sure it actually helps in the cases where it
> > would matter the most.
>
> I understand, but do you think it's acceptable with some additional
> conditions: e.g. narrow it down to only very high scanning priorities?
> Or !sc.may_swap case?
>
> In the end, we have the following code in get_scan_count(), so at
> least on priority 0 we ignore all costs anyway.
> if (!sc->priority && swappiness) {
> scan_balance = SCAN_EQUAL;
> goto out;
> }
>
> Wdyt?
I think relitigating a proven aging mechanism after half a decade in
production is going to be tough and require extensive testing.
If your primary problem is the cost of the locking, I'd focus on that.
> > It might make more sense to look into the cost (ha) of the cost
> > recording itself. Can we turn it into a vmstat item? That would make
> > it lockless, would get rstat batching up the cgroup tree etc. This
> > doesn't need to be 100% precise and race free after all.
>
> Idk, maybe yes, but rstat flushing was a source of the issues as well
> and now it's mostly ratelimited, so I'm concerned that because of that
> we'll have sudden changes in the reclaim behavior every 2 seconds.
That's not a new hazard, though. prepare_scan_control() decisions are
already subject to this, as is the lru cost aging itself.
Powered by blists - more mailing lists