linux-kernel - Re: [PATCH] zswap: export more zswap store failure stats

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20231024160904.GA1971738@cmpxchg.org>
Date:   Tue, 24 Oct 2023 12:09:04 -0400
From:   Johannes Weiner <hannes@...xchg.org>
To:     Nhat Pham <nphamcs@...il.com>
Cc:     akpm@...ux-foundation.org, cerasuolodomenico@...il.com,
        yosryahmed@...gle.com, sjenning@...hat.com, ddstreet@...e.org,
        vitaly.wool@...sulko.com, mhocko@...nel.org,
        roman.gushchin@...ux.dev, shakeelb@...gle.com,
        muchun.song@...ux.dev, linux-mm@...ck.org, kernel-team@...a.com,
        linux-kernel@...r.kernel.org, cgroups@...r.kernel.org
Subject: Re: [PATCH] zswap: export more zswap store failure stats

On Mon, Oct 23, 2023 at 05:07:02PM -0700, Nhat Pham wrote:
> Since:
> 
> "42c06a0e8ebe mm: kill frontswap"
> 
> we no longer have a counter to tracks the number of zswap store
> failures. This makes it hard to investigate and monitor for zswap
> issues.
> 
> This patch adds a global and a per-cgroup zswap store failure counter,
> as well as a dedicated debugfs counter for compression algorithm failure
> (which can happen for e.g when random data are passed to zswap).
> 
> Signed-off-by: Nhat Pham <nphamcs@...il.com>

I agree this is an issue.

> ---
>  include/linux/vm_event_item.h |  1 +
>  mm/memcontrol.c               |  1 +
>  mm/vmstat.c                   |  1 +
>  mm/zswap.c                    | 18 ++++++++++++++----
>  4 files changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 8abfa1240040..7b2b117b193d 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -145,6 +145,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #ifdef CONFIG_ZSWAP
>  		ZSWPIN,
>  		ZSWPOUT,
> +		ZSWPOUT_FAIL,

Would the writeback stat be sufficient to determine this?

Hear me out. We already have pswpout that shows when we're hitting
disk swap. Right now we can't tell if this is because of a rejection
or because of writeback. With a writeback counter we could.

And I think we want the writeback counter anyway going forward in
order to monitor and understand the dynamic shrinker's performance.

Either way we go, one of the metrics needs to be derived from the
other(s). But I think subtle and not so subtle shrinker issues are
more concerning than outright configuration problems where zswap
doesn't work at all. The latter is easier to catch before or during
early deployment with simple functionality tests.

Plus, rejections should be rare. They are now, and they should become
even more rare or cease to exist going forward. Because every time
they happen at scale, they represent problematic LRU inversions.  We
have patched, have pending patches, or discussed changes to reduce
every single one of them:

	/* Store failed due to a reclaim failure after pool limit was reached */
	static u64 zswap_reject_reclaim_fail;

With the shrinker this becomes less relevant. There was also the
proposal to disable the bypass to swap and just keep the page.

	/* Compressed page was too big for the allocator to (optimally) store */
	static u64 zswap_reject_compress_poor;

You were working on eradicating this (with zsmalloc at least).

	/* Store failed because underlying allocator could not get memory */
	static u64 zswap_reject_alloc_fail;
	/* Store failed because the entry metadata could not be allocated (rare) */
	static u64 zswap_reject_kmemcache_fail;

These shouldn't happen at all due to PF_MEMALLOC.

IOW, the fail counter is expected to stay zero in healthy,
well-configured systems. Rather than an MM event that needs counting,
this strikes me as something that could be a WARN down the line...

I agree with adding the debugfs counter though.