linux-kernel - Re: [PATCH] Revert "mm:vmscan: fix inaccurate reclaim during proactive reclaim"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240123164819.GB1745986@cmpxchg.org>
Date: Tue, 23 Jan 2024 11:48:19 -0500
From: Johannes Weiner <hannes@...xchg.org>
To: "T.J. Mercier" <tjmercier@...gle.com>
Cc: Michal Hocko <mhocko@...e.com>,
	Roman Gushchin <roman.gushchin@...ux.dev>,
	Shakeel Butt <shakeelb@...gle.com>,
	Muchun Song <muchun.song@...ux.dev>,
	Andrew Morton <akpm@...ux-foundation.org>, android-mm@...gle.com,
	yuzhao@...gle.com, yangyifei03@...ishou.com,
	cgroups@...r.kernel.org, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] Revert "mm:vmscan: fix inaccurate reclaim during
 proactive reclaim"

The revert isn't a straight-forward solution.

The patch you're reverting fixed conventional reclaim and broke
MGLRU. Your revert fixes MGLRU and breaks conventional reclaim.

On Tue, Jan 23, 2024 at 05:58:05AM -0800, T.J. Mercier wrote:
> They both are able to make progress. The main difference is that a
> single iteration of try_to_free_mem_cgroup_pages with MGLRU ends soon
> after it reclaims nr_to_reclaim, and before it touches all memcgs. So
> a single iteration really will reclaim only about SWAP_CLUSTER_MAX-ish
> pages with MGLRU. WIthout MGLRU the memcg walk is not aborted
> immediately after nr_to_reclaim is reached, so a single call to
> try_to_free_mem_cgroup_pages can actually reclaim thousands of pages
> even when sc->nr_to_reclaim is 32. (I.E. MGLRU overreclaims less.)
> https://lore.kernel.org/lkml/20221201223923.873696-1-yuzhao@google.com/

Is that a feature or a bug?

 * 1. Memcg LRU only applies to global reclaim, and the round-robin incrementing
 *    of their max_seq counters ensures the eventual fairness to all eligible
 *    memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().

If it bails out exactly after nr_to_reclaim, it'll overreclaim
less. But with steady reclaim in a complex subtree, it will always hit
the first cgroup returned by mem_cgroup_iter() and then bail. This
seems like a fairness issue.

We should figure out what the right method for balancing fairness with
overreclaim is, regardless of reclaim implementation. Because having
two different approaches and reverting dependent things back and forth
doesn't make sense.

Using an LRU to rotate through memcgs over multiple reclaim cycles
seems like a good idea. Why is this specific to MGLRU? Shouldn't this
be a generic piece of memcg infrastructure?

Then there is the question of why there is an LRU for global reclaim,
but not for subtree reclaim. Reclaiming a container with multiple
subtrees would benefit from the fairness provided by a container-level
LRU order just as much; having fairness for root but not for subtrees
would produce different reclaim and pressure behavior, and can cause
regressions when moving a service from bare-metal into a container.

Figuring out these differences and converging on a method for cgroup
fairness would be the better way of fixing this. Because of the
regression risk to the default reclaim implementation, I'm inclined to
NAK this revert.