linux-kernel - Re: [PATCH] mm: skip lru_note_cost() when scanning only file or anon

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250714152247.GB991@cmpxchg.org>
Date: Mon, 14 Jul 2025 11:22:47 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Roman Gushchin <roman.gushchin@...ux.dev>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	Shakeel Butt <shakeel.butt@...ux.dev>,
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	Michal Hocko <mhocko@...nel.org>,
	David Hildenbrand <david@...hat.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm: skip lru_note_cost() when scanning only file or anon

On Fri, Jul 11, 2025 at 10:55:48AM -0700, Roman Gushchin wrote:
> Johannes Weiner <hannes@...xchg.org> writes:
> > The caveat with this patch is that, aside from the static noswap
> > scenario, modes can switch back and forth abruptly or even overlap.
> >
> > So if you leave a pressure scenario and go back to cache trimming, you
> > will no longer age the cost information anymore. The next spike could
> > be starting out with potentially quite stale information.
> >
> > Or say proactive reclaim recently already targeted anon, and there
> > were rotations and pageouts; that would be useful data for a reactive
> > reclaimer doing work at around the same time, or shortly thereafter.
> 
> Agree, but at the same time it's possible to come up with the scenario
> when it's not good.
>   A
>  / \
> B  C  memory.max=X
>   / \
>  D   E
> 
> Let's say we have a cgroup structure like this, we apply a lot
> of proactive anon pressure on E, then the pressure from on D from
> C's limit will be biased towards file without a good reason.

No, this is on purpose. D and E are not independent. They're in the
same memory domain, C. So if you want to reclaim C, and a subset of
its anon has already been pressured to resistance, then a larger part
of the reclaim candidates in C will need to come from file.

> Or as in my case, if a cgroup has memory.memsw.limit set and is
> thrashing, does it makes sense to bias the rest of the system
> into anon reclaim? The recorded cost can really large.
> 
> >
> > So for everything but the static noswap case, the patch makes me
> > nervous. And I'm not sure it actually helps in the cases where it
> > would matter the most.
> 
> I understand, but do you think it's acceptable with some additional
> conditions: e.g. narrow it down to only very high scanning priorities?
> Or !sc.may_swap case?
> 
> In the end, we have the following code in get_scan_count(), so at
> least on priority 0 we ignore all costs anyway.
>         if (!sc->priority && swappiness) {
>                 scan_balance = SCAN_EQUAL;
>                 goto out;
>         }
> 
> Wdyt?

I think relitigating a proven aging mechanism after half a decade in
production is going to be tough and require extensive testing.

If your primary problem is the cost of the locking, I'd focus on that.

> > It might make more sense to look into the cost (ha) of the cost
> > recording itself. Can we turn it into a vmstat item? That would make
> > it lockless, would get rstat batching up the cgroup tree etc. This
> > doesn't need to be 100% precise and race free after all.
> 
> Idk, maybe yes, but rstat flushing was a source of the issues as well
> and now it's mostly ratelimited, so I'm concerned that because of that
> we'll have sudden changes in the reclaim behavior every 2 seconds.

That's not a new hazard, though. prepare_scan_control() decisions are
already subject to this, as is the lru cost aging itself.