linux-kernel - Re: [PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPpodddcGsK=0Xczfuk8usgZ47xeyf4ZjiofdT+ujiyz6V2pFQ@mail.gmail.com>
Date: Thu, 20 Jun 2024 10:03:17 +0900
From: Takero Funaki <flintglass@...il.com>
To: Yosry Ahmed <yosryahmed@...gle.com>
Cc: Nhat Pham <nphamcs@...il.com>, Johannes Weiner <hannes@...xchg.org>, 
	Chengming Zhou <chengming.zhou@...ux.dev>, Jonathan Corbet <corbet@....net>, 
	Andrew Morton <akpm@...ux-foundation.org>, 
	Domenico Cerasuolo <cerasuolodomenico@...il.com>, linux-mm@...ck.org, linux-doc@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink

Hello,

Sorry for the late reply. I am currently investigating a
responsiveness issue I found while benchmarking with this series,
possibly related to concurrent zswap writeback and pageouts.

This series cannot be applied until the root cause is identified,
unfortunately. Thank you all for taking the time to review.

The responsiveness issue was confirmed with 6.10-rc2 with all 3
patches applied. Without patch 3, it still happens but is less likely.

When allocating much larger memory than zswap can buffer, and
writeback and rejection by pool_limit_hit happen simultaneously, the
system stops responding. I do not see this freeze when zswap is
disabled or when there is no pool_limit_hit. The proactive shrinking
itself seems to work as expected as long as the writeback and pageout
do not occur simultaneously.

I suspect this issue exists in current code but was not visible
without this series since the global shrinker did not writeback
considerable amount of pages.


2024年6月15日(土) 7:48 Nhat Pham <nphamcs@...il.com>:
>
> BTW, I'm curious. Have you experimented with increasing the pool size?
> That 20% number is plenty for our use cases, but maybe yours need a
> different cap?
>

Probably we can allocate a bit more zswap pool size. But that will
keep more old pages once the pool limit is hit. If we can ensure no
pool limit hits and zero writeback by allocating more memory, I will
try the same amount of zramswap.

> Also, have you experimented with the dynamic zswap shrinker? :) I'm
> actually curious how it works out in the small machine regime, with
> whatever workload you are running.
>

It seems the dynamic shrinker is trying to evict all pages. That does
not fit to my use case that prefer balanced swapin and swapout
performance


2024年6月15日(土) 9:20 Yosry Ahmed <yosryahmed@...gle.com>:
> >
> > 1.
> > The visible issue is that pageout/in operations from active processes
> > are slow when zswap is near its max pool size. This is particularly
> > significant on small memory systems, where total swap usage exceeds
> > what zswap can store. This means that old pages occupy most of the
> > zswap pool space, and recent pages use swap disk directly.
>
> This should be a transient state though, right? Once the shrinker
> kicks in it should writeback the old pages and make space for the hot
> ones. Which takes us to our next point.
>
> >
> > 2.
> > This issue is caused by zswap keeping the pool size near 100%. Since
> > the shrinker fails to shrink the pool to accept_thr_percent and zswap
> > rejects incoming pages, rejection occurs more frequently than it
> > should. The rejected pages are directly written to disk while zswap
> > protects old pages from eviction, leading to slow pageout/in
> > performance for recent pages on the swap disk.
>
> Why is the shrinker failing? IIUC the first two patches fixes two
> cases where the shrinker stumbles upon offline memcgs, or memcgs with
> no zswapped pages. Are these cases common enough in your use case that
> every single time the shrinker runs it hits MAX_RECLAIM_RETRIES before
> putting the zswap usage below accept_thr_percent?
>
> This would be surprising given that we should be restarting the
> shrinker with every swapout attempt until we can accept pages again.
>
> I guess one could construct a malicious case where there are some
> sticky offline memcgs, and all the memcgs that actually have zswap
> pages come after it in the iteration order.
>
> Could you shed more light about this? What does the setup look like?
> How many memcgs there are, how many of them use zswap, and how many
> offline memcgs are you observing?
>

Example from ubuntu 22.04 using zswap:
root@ctl:~# find /sys/fs/cgroup/ -wholename
\*service/memory.zswap.current | xargs grep . | wc
     31      31    2557
root@ctl:~# find /sys/fs/cgroup/ -wholename
\*service/memory.zswap.current | xargs grep ^0 | wc
     11      11     911

This indicates 11 out of 31 services have no pages in zswap. Without
patch 2, shrink_worker() aborts shrinking in the second tree walk,
before evicting about 40 pages from the services. The number varies,
but I think it is common to see a few memcg that has no zswap pages

> I am not saying we shouldn't fix these problems anyway, I am just
> trying to understand how we got into this situation to begin with.
>
> >
> > 3.
> > If the pool size were shrunk proactively, rejection by pool limit hits
> > would be less likely. New incoming pages could be accepted as the pool
> > gains some space in advance, while older pages are written back in the
> > background. zswap would then be filled with recent pages, as expected
> > in the LRU logic.
>
> I suspect if patches 1 and 2 fix your problem, the shrinker invoked
> from reclaim should be doing this sort of "proactive shrinking".
>
> I agree that the current hysteresis around accept_thr_percent is not
> good enough, but I am surprised you are hitting the pool limit if the
> shrinker is being run during reclaim.
>
> >
> > Patch 1 and 2 make the shrinker reduce the pool to accept_thr_percent.
> > Patch 3 makes zswap_store trigger the shrinker before reaching the max
> > pool size. With this series, zswap will prepare some space to reduce
> > the probability of problematic pool_limit_hit situation, thus reducing
> > slow reclaim and the page priority inversion against LRU.
> >
> > 4.
> > Once proactive shrinking reduces the pool size, pageouts complete
> > instantly as long as the space prepared by shrinking can store the
> > direct reclaim. If an admin sees a large pool_limit_hit, lowering
> > accept_threshold_percent will improve active process performance.
>
> I agree that proactive shrinking is preferable to waiting until we hit
> pool limit, then stop taking in pages until the acceptance threshold.
> I am just trying to understand whether such a proactive shrinking
> mechanism will be needed if the reclaim shrinker for zswap is being
> used, how the two would work together.

For my workload, the dynamic shrinker (reclaim shrinker) is disabled.
The proposed global shrinker and the existing dynamic shrinker are
both proactive, but their goals are different.

The global shrinker starts shrinking when the zswap pool exceeds
accept_thr_percent + 1%, then stops when it reaches
accept_thr_percent. Pages below accept_thr_percent are protected from
shrinking.

The dynamic shrinker starts shrinking based on memory pressure
regardless of the zswap pool size, and stops when the LRU size is
reduced to 1/4. Its goal is to wipe out all pages from zswap. It
prefers swapout performance only.

I think the current LRU logic decreases nr_zswap_protected too quickly
for my workload. In zswap_lru_add(), nr_zswap_protected is reduced to
between 1/4 and 1/8 of the LRU size. Although zswap_folio_swapin()
increments nr_zswap_protected when page-ins of evicted pages occur
later, this technically has no effect while reclaim is in progress.

While zswap_store() and zswap_lru_add() are called, the dynamic
shrinker is likely running due to the pressure. The dynamic shrinker
reduces the LRU size to 1/4, and then a few subsequent zswap_store()
calls reduce the protected count to 1/4 of the LRU size. The stored
pages will be reduced to zero through a few shrinker_scan calls.