linux-kernel - Re: [PATCH] zswap: export more zswap store failure stats

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKEwX=PrLaJU2py+nqkSObBx8kafdbNYn0GZVLPkSixDAEb1GA@mail.gmail.com>
Date:   Tue, 24 Oct 2023 10:25:36 -0700
From:   Nhat Pham <nphamcs@...il.com>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     akpm@...ux-foundation.org, cerasuolodomenico@...il.com,
        yosryahmed@...gle.com, sjenning@...hat.com, ddstreet@...e.org,
        vitaly.wool@...sulko.com, mhocko@...nel.org,
        roman.gushchin@...ux.dev, shakeelb@...gle.com,
        muchun.song@...ux.dev, linux-mm@...ck.org, kernel-team@...a.com,
        linux-kernel@...r.kernel.org, cgroups@...r.kernel.org
Subject: Re: [PATCH] zswap: export more zswap store failure stats

On Tue, Oct 24, 2023 at 9:09 AM Johannes Weiner <hannes@...xchg.org> wrote:
>
> On Mon, Oct 23, 2023 at 05:07:02PM -0700, Nhat Pham wrote:
> > Since:
> >
> > "42c06a0e8ebe mm: kill frontswap"
> >
> > we no longer have a counter to tracks the number of zswap store
> > failures. This makes it hard to investigate and monitor for zswap
> > issues.
> >
> > This patch adds a global and a per-cgroup zswap store failure counter,
> > as well as a dedicated debugfs counter for compression algorithm failure
> > (which can happen for e.g when random data are passed to zswap).
> >
> > Signed-off-by: Nhat Pham <nphamcs@...il.com>
>
> I agree this is an issue.
>
> > ---
> >  include/linux/vm_event_item.h |  1 +
> >  mm/memcontrol.c               |  1 +
> >  mm/vmstat.c                   |  1 +
> >  mm/zswap.c                    | 18 ++++++++++++++----
> >  4 files changed, 17 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> > index 8abfa1240040..7b2b117b193d 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -145,6 +145,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  #ifdef CONFIG_ZSWAP
> >               ZSWPIN,
> >               ZSWPOUT,
> > +             ZSWPOUT_FAIL,
>
> Would the writeback stat be sufficient to determine this?
>
> Hear me out. We already have pswpout that shows when we're hitting
> disk swap. Right now we can't tell if this is because of a rejection
> or because of writeback. With a writeback counter we could.

Oh I see! It's a bit of an extra step, but I supposed (pswpout - writeback)
could give us the number of zswap store failures.

>
> And I think we want the writeback counter anyway going forward in
> order to monitor and understand the dynamic shrinker's performance.

Domenico and I were talking about this, and we both agree the writeback
counter is absolutely necessary - if anything, to make sure that the
shrinker is not a) completely not working or b) going overboard.

So it is coming as part of the shrinker regardless of this.
I just didn't realize that it also solves this issue we're having too!

>
> Either way we go, one of the metrics needs to be derived from the
> other(s). But I think subtle and not so subtle shrinker issues are
> more concerning than outright configuration problems where zswap
> doesn't work at all. The latter is easier to catch before or during
> early deployment with simple functionality tests.
>
> Plus, rejections should be rare. They are now, and they should become
> even more rare or cease to exist going forward. Because every time
> they happen at scale, they represent problematic LRU inversions.  We
> have patched, have pending patches, or discussed changes to reduce
> every single one of them:
>
>         /* Store failed due to a reclaim failure after pool limit was reached */
>         static u64 zswap_reject_reclaim_fail;
>
> With the shrinker this becomes less relevant. There was also the
> proposal to disable the bypass to swap and just keep the page.

The shrinker and that proposal sound like good ideas ;)

>
>         /* Compressed page was too big for the allocator to (optimally) store */
>         static u64 zswap_reject_compress_poor;
>
> You were working on eradicating this (with zsmalloc at least).
>
>         /* Store failed because underlying allocator could not get memory */
>         static u64 zswap_reject_alloc_fail;
>         /* Store failed because the entry metadata could not be allocated (rare) */
>         static u64 zswap_reject_kmemcache_fail;
>
> These shouldn't happen at all due to PF_MEMALLOC.
>
> IOW, the fail counter is expected to stay zero in healthy,
> well-configured systems. Rather than an MM event that needs counting,
> this strikes me as something that could be a WARN down the line...
>

Yup, I agree that it should (mostly) be at 0. It being non-zero (especially
at a higher ratio w.r.t total number of zswap store counts) is an indication
of something wrong - either a bug, misconfiguration, or a very
ill-compressible workload (or again a bug with the compression algorithm).

A WARN might be good too, but if it's just an ill-compressible workload
that might be too many WARNS :)

But we can always just monitor pswpout - writeback (both globally,
and on a cgroup-basis, I assume?).

> I agree with adding the debugfs counter though.

Then I'll send a new patch that focuses on the debugfs counter
(for the compression failure).

Thanks for the feedback, Johannes.