[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKEwX=PrLaJU2py+nqkSObBx8kafdbNYn0GZVLPkSixDAEb1GA@mail.gmail.com>
Date: Tue, 24 Oct 2023 10:25:36 -0700
From: Nhat Pham <nphamcs@...il.com>
To: Johannes Weiner <hannes@...xchg.org>
Cc: akpm@...ux-foundation.org, cerasuolodomenico@...il.com,
yosryahmed@...gle.com, sjenning@...hat.com, ddstreet@...e.org,
vitaly.wool@...sulko.com, mhocko@...nel.org,
roman.gushchin@...ux.dev, shakeelb@...gle.com,
muchun.song@...ux.dev, linux-mm@...ck.org, kernel-team@...a.com,
linux-kernel@...r.kernel.org, cgroups@...r.kernel.org
Subject: Re: [PATCH] zswap: export more zswap store failure stats
On Tue, Oct 24, 2023 at 9:09 AM Johannes Weiner <hannes@...xchg.org> wrote:
>
> On Mon, Oct 23, 2023 at 05:07:02PM -0700, Nhat Pham wrote:
> > Since:
> >
> > "42c06a0e8ebe mm: kill frontswap"
> >
> > we no longer have a counter to tracks the number of zswap store
> > failures. This makes it hard to investigate and monitor for zswap
> > issues.
> >
> > This patch adds a global and a per-cgroup zswap store failure counter,
> > as well as a dedicated debugfs counter for compression algorithm failure
> > (which can happen for e.g when random data are passed to zswap).
> >
> > Signed-off-by: Nhat Pham <nphamcs@...il.com>
>
> I agree this is an issue.
>
> > ---
> > include/linux/vm_event_item.h | 1 +
> > mm/memcontrol.c | 1 +
> > mm/vmstat.c | 1 +
> > mm/zswap.c | 18 ++++++++++++++----
> > 4 files changed, 17 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> > index 8abfa1240040..7b2b117b193d 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -145,6 +145,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> > #ifdef CONFIG_ZSWAP
> > ZSWPIN,
> > ZSWPOUT,
> > + ZSWPOUT_FAIL,
>
> Would the writeback stat be sufficient to determine this?
>
> Hear me out. We already have pswpout that shows when we're hitting
> disk swap. Right now we can't tell if this is because of a rejection
> or because of writeback. With a writeback counter we could.
Oh I see! It's a bit of an extra step, but I supposed (pswpout - writeback)
could give us the number of zswap store failures.
>
> And I think we want the writeback counter anyway going forward in
> order to monitor and understand the dynamic shrinker's performance.
Domenico and I were talking about this, and we both agree the writeback
counter is absolutely necessary - if anything, to make sure that the
shrinker is not a) completely not working or b) going overboard.
So it is coming as part of the shrinker regardless of this.
I just didn't realize that it also solves this issue we're having too!
>
> Either way we go, one of the metrics needs to be derived from the
> other(s). But I think subtle and not so subtle shrinker issues are
> more concerning than outright configuration problems where zswap
> doesn't work at all. The latter is easier to catch before or during
> early deployment with simple functionality tests.
>
> Plus, rejections should be rare. They are now, and they should become
> even more rare or cease to exist going forward. Because every time
> they happen at scale, they represent problematic LRU inversions. We
> have patched, have pending patches, or discussed changes to reduce
> every single one of them:
>
> /* Store failed due to a reclaim failure after pool limit was reached */
> static u64 zswap_reject_reclaim_fail;
>
> With the shrinker this becomes less relevant. There was also the
> proposal to disable the bypass to swap and just keep the page.
The shrinker and that proposal sound like good ideas ;)
>
> /* Compressed page was too big for the allocator to (optimally) store */
> static u64 zswap_reject_compress_poor;
>
> You were working on eradicating this (with zsmalloc at least).
>
> /* Store failed because underlying allocator could not get memory */
> static u64 zswap_reject_alloc_fail;
> /* Store failed because the entry metadata could not be allocated (rare) */
> static u64 zswap_reject_kmemcache_fail;
>
> These shouldn't happen at all due to PF_MEMALLOC.
>
> IOW, the fail counter is expected to stay zero in healthy,
> well-configured systems. Rather than an MM event that needs counting,
> this strikes me as something that could be a WARN down the line...
>
Yup, I agree that it should (mostly) be at 0. It being non-zero (especially
at a higher ratio w.r.t total number of zswap store counts) is an indication
of something wrong - either a bug, misconfiguration, or a very
ill-compressible workload (or again a bug with the compression algorithm).
A WARN might be good too, but if it's just an ill-compressible workload
that might be too many WARNS :)
But we can always just monitor pswpout - writeback (both globally,
and on a cgroup-basis, I assume?).
> I agree with adding the debugfs counter though.
Then I'll send a new patch that focuses on the debugfs counter
(for the compression failure).
Thanks for the feedback, Johannes.
Powered by blists - more mailing lists