linux-kernel - Re: [PATCH] mm: skip lru_note_cost() when scanning only file or anon

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8734b2vcgr.fsf@linux.dev>
Date: Fri, 11 Jul 2025 10:55:48 -0700
From: Roman Gushchin <roman.gushchin@...ux.dev>
To: Johannes Weiner <hannes@...xchg.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>,  Shakeel Butt
 <shakeel.butt@...ux.dev>,  Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
  Michal Hocko <mhocko@...nel.org>,  David Hildenbrand <david@...hat.com>,
  linux-mm@...ck.org,  linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm: skip lru_note_cost() when scanning only file or anon

Johannes Weiner <hannes@...xchg.org> writes:

> On Fri, Jul 11, 2025 at 08:50:44AM -0700, Roman Gushchin wrote:
>> lru_note_cost() records relative cost of incurring io and cpu spent
>> on lru rotations, which is used to balance the pressure on file and
>> anon memory. The applied pressure is inversely proportional to the
>> recorded cost of reclaiming, but only within 2/3 of the range
>> (swappiness aside).
>> 
>> This is useful when both anon and file memory is reclaimable, however
>> in many cases it's not the case: e.g. there might be no swap,
>> proactive reclaim can target anon memory specifically,
>> the memory pressure can come from cgroup v1's memsw limit, etc.
>> In all these cases recording the cost will only bias all following
>> reclaim, also potentially outside of the scope of the original memcg.
>> 
>> So it's better to not record the cost if it comes from the initially
>> biased reclaim.
>> 
>> lru_note_cost() is a relatively expensive function, which traverses
>> the memcg tree up to the root and takes the lruvec lock on each level.
>> Overall it's responsible for about 50% of cycles spent on lruvec lock,
>> which might be a non-trivial number overall under heavy memory
>> pressure. So optimizing out a large number of lru_note_cost() calls
>> is also beneficial from the performance perspective.
>
> Does it actually help? It's under elevated pressure, when lru locks
> are the most contended, that we also usually scan both lists.

It does, but it's mostly about !swap or memsw limit case.
It's also mostly pronounced only during high memory pressure
when there are many rotations.

And it's pretty significant in our case: I'm actually trying to address
a production regression caused by commit 0538a82c39e9 ("mm: vmscan: make
rotations a secondary factor in balancing anon vs file"), which
added another lru_note_cost() call.


> The caveat with this patch is that, aside from the static noswap
> scenario, modes can switch back and forth abruptly or even overlap.
>
> So if you leave a pressure scenario and go back to cache trimming, you
> will no longer age the cost information anymore. The next spike could
> be starting out with potentially quite stale information.
>
> Or say proactive reclaim recently already targeted anon, and there
> were rotations and pageouts; that would be useful data for a reactive
> reclaimer doing work at around the same time, or shortly thereafter.

Agree, but at the same time it's possible to come up with the scenario
when it's not good.
  A
 / \
B  C  memory.max=X
  / \
 D   E

Let's say we have a cgroup structure like this, we apply a lot
of proactive anon pressure on E, then the pressure from on D from
C's limit will be biased towards file without a good reason.

Or as in my case, if a cgroup has memory.memsw.limit set and is
thrashing, does it makes sense to bias the rest of the system
into anon reclaim? The recorded cost can really large.

>
> So for everything but the static noswap case, the patch makes me
> nervous. And I'm not sure it actually helps in the cases where it
> would matter the most.

I understand, but do you think it's acceptable with some additional
conditions: e.g. narrow it down to only very high scanning priorities?
Or !sc.may_swap case?

In the end, we have the following code in get_scan_count(), so at
least on priority 0 we ignore all costs anyway.
        if (!sc->priority && swappiness) {
                scan_balance = SCAN_EQUAL;
                goto out;
        }

Wdyt?

>
> It might make more sense to look into the cost (ha) of the cost
> recording itself. Can we turn it into a vmstat item? That would make
> it lockless, would get rstat batching up the cgroup tree etc. This
> doesn't need to be 100% precise and race free after all.

Idk, maybe yes, but rstat flushing was a source of the issues as well
and now it's mostly ratelimited, so I'm concerned that because of that
we'll have sudden changes in the reclaim behavior every 2 seconds.