linux-kernel - Re: [PATCH v1 2/3] mm: zswap: fix global shrinker error handling logic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKEwX=NSaRAjiKjGtYxPwh9ByBZ_DK+h3T6LS5-eNpxS4s4zPA@mail.gmail.com>
Date: Tue, 11 Jun 2024 08:51:02 -0700
From: Nhat Pham <nphamcs@...il.com>
To: Takero Funaki <flintglass@...il.com>
Cc: Yosry Ahmed <yosryahmed@...gle.com>, Johannes Weiner <hannes@...xchg.org>, 
	Chengming Zhou <chengming.zhou@...ux.dev>, Jonathan Corbet <corbet@....net>, 
	Andrew Morton <akpm@...ux-foundation.org>, 
	Domenico Cerasuolo <cerasuolodomenico@...il.com>, linux-mm@...ck.org, linux-doc@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1 2/3] mm: zswap: fix global shrinker error handling logic

On Tue, Jun 11, 2024 at 8:21 AM Takero Funaki <flintglass@...il.com> wrote:
>
>
> Since shrink_worker evicts only one page per tree walk when there is
> only one memcg using zswap, I believe this is the intended behavior.

I don't think this is the intended behavior :) It's a holdover from
the old zswap reclaiming behaviors.

1. In the past, we used to shrink one object per shrink worker call.
This is crazy.

2. We then move the LRU from the allocator level to zswap level, and
shrink one object at a time until the pool can accept new pages (i.e
under the acceptance threshold).

3. When we separate the LRU to per-(memcg, node), we keep the
shrink-one-at-a-time part, but do it round-robin style on each of the
(memcg, node) combination.

It's time to optimize this. 4th time's the charm!

> Even if we choose to break the loop more aggressively, it would only
> be postponing the problem because pool_limit_hit will trigger the
> worker again.
>
> I agree the existing approach is inefficient. It might be better to
> change the 1 page in a round-robin strategy.

We can play with a bigger batch.

1. Most straightforward idea is to just use a bigger constant (32? 64? 128?)

2. We can try to shrink until accept for each memcg, hoping that the
round robin selection maintains fairness in the long run - but this
can be a bad idea in the short run for the memcg selected. At the very
least, this should try to respect the protected area for each lruvec.
This might still come into conflict with the zswap shrinker though
(since the protection is best-effort).

3. Proportional reclaim - a variant of what we're doing in
get_scan_count() for page reclaim?

scan = lruvec_size - lruvec_size * protection / (cgroup_size + 1);

protection is derived from memory.min or memory.low of the cgroup, and
cgroup_size is the memory usage of the cgroup. lruvec_size maybe we
can substitute with the number of (reclaimable/unprotected?) zswap
objects in the (node, memcg) lru?