[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240730033929.GB2866591@cmpxchg.org>
Date: Mon, 29 Jul 2024 23:39:29 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Yosry Ahmed <yosryahmed@...gle.com>
Cc: Nhat Pham <nphamcs@...il.com>, akpm@...ux-foundation.org,
shakeelb@...gle.com, linux-mm@...ck.org, kernel-team@...a.com,
linux-kernel@...r.kernel.org, flintglass@...il.com
Subject: Re: [PATCH 1/2] zswap: implement a second chance algorithm for
dynamic zswap shrinker
On Fri, Jul 26, 2024 at 02:58:14PM -0700, Yosry Ahmed wrote:
> On Thu, Jul 25, 2024 at 4:28 PM Nhat Pham <nphamcs@...il.com> wrote:
> >
> > Current zswap shrinker's heursitics to prevent overshrinking is brittle
> > and inaccurate, specifically in the way we decay the protection size
> > (i.e making pages in the zswap LRU eligible for reclaim).
>
> Thanks for working on this and experimenting with different
> heuristics. I was not a huge fan of these, so I am glad we are trying
> to replace them with something more intuitive.
>
> >
> > We currently decay protection aggressively in zswap_lru_add() calls.
> > This leads to the following unfortunate effect: when a new batch of
> > pages enter zswap, the protection size rapidly decays to below 25% of
> > the zswap LRU size, which is way too low.
> >
> > We have observed this effect in production, when experimenting with the
> > zswap shrinker: the rate of shrinking shoots up massively right after a
> > new batch of zswap stores. This is somewhat the opposite of what we want
> > originally - when new pages enter zswap, we want to protect both these
> > new pages AND the pages that are already protected in the zswap LRU.
> >
> > Replace existing heuristics with a second chance algorithm
> >
> > 1. When a new zswap entry is stored in the zswap pool, its reference bit
> > is set.
> > 2. When the zswap shrinker encounters a zswap entry with the reference
> > bit set, give it a second chance - only flips the reference bit and
> > rotate it in the LRU.
> > 3. If the shrinker encounters the entry again, this time with its
> > reference bit unset, then it can reclaim the entry.
>
> At the first look, this is similar to the reclaim algorithm. A
> fundamental difference here is that the reference bit is only set
> once, when the entry is created. It is different from the conventional
> second chance page reclaim/replacement algorithm.
>
> What this really does, is that it slows down writeback by enforcing
> that we need to iterate entries exactly twice before we write them
> back. This sounds a little arbitrary and not very intuitive to me.
This isn't different than other second chance algorithms. Those
usually set the reference bit again to buy the entry another round. In
our case, another reference causes a zswapin, which removes the entry
from the list - buying it another round. Entries will get reclaimed
once the scan rate catches up with the longest reuse distance.
The main goal, which was also the goal of the protection math, is to
slow down writebacks in proportion to new entries showing up. This
gives zswap a chance to solve memory pressure through compression. If
memory pressure persists, writeback should pick up.
If no new entries were to show up, then sure, this would be busy
work. In practice, new entries do show up at a varying rate. This all
happens in parallel to anon reclaim, after all. The key here is that
new entries will be interleaved with rotated entries, and they consume
scan work! This is what results in the proportional slowdown.
> Taking a step back, what we really want is to writeback zswap entries
> in order, and avoid writing back more entries than needed. I think the
> key here is "when needed", which is defined by how much memory
> pressure we have. The shrinker framework should already be taking this
> into account.
>
> Looking at do_shrink_slab(), in the case of zswap (seek = 2),
> total_scan should boil down to:
>
> total_scan = (zswap_shrinker_count() * 2 + nr_deferred) >> priority
>
> , and this is bounded by zswap_shrinker_count() * 2.
>
> Ignoring nr_deferred, we start by scanning 1/2048th of
> zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 *
> zswap_shrinker_count() at zero priority (before OOMs). At
> NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count().
>
> Keep in mind that zswap_shrinker_count() does not return the number of
> all zswap entries, it subtracts the protected part (or recent swapins)
> and scales by the compression ratio. So this looks reasonable at first
> sight, perhaps we want to tune the seek to slow down writeback if we
> think it's too much, but that doesn't explain the scenario you are
> describing.
>
> Now let's factor in nr_deferred, which looks to me like it could be
> the culprit here. I am assuming the intention is that if we counted
> freeable slab objects before but didn't get to free them, we should do
> it the next time around. This feels like it assumes that the objects
> will remain there unless reclaimed by the shrinker. This does not
> apply for zswap, because the objects can be swapped in.
Hm.
_count() returns (objects - protected) * compression_rate, then the
shrinker does the >> priority dance. So to_scan is expected to be a
small portion of unprotected objects.
_scan() bails if to_scan > (objects - protected).
How often does this actually abort in practice?
> Also, in the beginning, before we encounter too many swapins, the
> protection will be very low, so zswap_shrinker_count() will return a
> relatively high value. Even if we don't scan and writeback this
> amount, we will keep carrying this value forward in next reclaim
> operations, even if the number of existing zswap entries have
> decreased due to swapins.
>
> Could this be the problem? The number of deferred objects to be
> scanned just keeps going forward as a high value, essentially
> rendering the heuristics in zswap_shrinker_count() useless?
>
> If we just need to slow down writeback by making sure we scan entries
> twice, could something similar be achieved just by tuning the seek
> without needing any heuristics to begin with?
Seek is a fixed coefficient for the scan rate.
We want to slow writeback when recent zswapouts dominate the zswap
pool (expanding or thrashing), and speed it up when recent entries
make up a small share of the pool (stagnating).
This is what the second chance accomplishes.
Powered by blists - more mailing lists