linux-kernel - RE: [PATCH v6 16/16] mm: zswap: Fix for zstd performance regression with 2M folios.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SA3PR11MB81204A08DD5F8E5C58EB526AC9CC2@SA3PR11MB8120.namprd11.prod.outlook.com>
Date: Fri, 28 Feb 2025 10:00:13 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
	<hannes@...xchg.org>, "nphamcs@...il.com" <nphamcs@...il.com>,
	"chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
	"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com"
	<ryan.roberts@....com>, "21cnbao@...il.com" <21cnbao@...il.com>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
	"herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>,
	"davem@...emloft.net" <davem@...emloft.net>, "clabbe@...libre.com"
	<clabbe@...libre.com>, "ardb@...nel.org" <ardb@...nel.org>,
	"ebiggers@...gle.com" <ebiggers@...gle.com>, "surenb@...gle.com"
	<surenb@...gle.com>, "Accardi, Kristen C" <kristen.c.accardi@...el.com>,
	"Feghali, Wajdi K" <wajdi.k.feghali@...el.com>, "Gopal, Vinodh"
	<vinodh.gopal@...el.com>, "Sridhar, Kanchana P"
	<kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v6 16/16] mm: zswap: Fix for zstd performance regression
 with 2M folios.

Hi Yosry, Nhat,

> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@...ux.dev>
> Sent: Thursday, February 6, 2025 11:15 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; nphamcs@...il.com; chengming.zhou@...ux.dev;
> usamaarif642@...il.com; ryan.roberts@....com; 21cnbao@...il.com;
> akpm@...ux-foundation.org; linux-crypto@...r.kernel.org;
> herbert@...dor.apana.org.au; davem@...emloft.net;
> clabbe@...libre.com; ardb@...nel.org; ebiggers@...gle.com;
> surenb@...gle.com; Accardi, Kristen C <kristen.c.accardi@...el.com>;
> Feghali, Wajdi K <wajdi.k.feghali@...el.com>; Gopal, Vinodh
> <vinodh.gopal@...el.com>
> Subject: Re: [PATCH v6 16/16] mm: zswap: Fix for zstd performance
> regression with 2M folios.
> 
> On Wed, Feb 05, 2025 at 11:21:02PM -0800, Kanchana P Sridhar wrote:
> > With the previous patch that enables support for batch compressions in
> > zswap_compress_folio(), a 6.2% throughput regression was seen with zstd
> and
> > 2M folios, using vm-scalability/usemem.
> >
> > For compressors that don't support batching, this was root-caused to the
> > following zswap_store_folio() structure:
> >
> >  Batched stores:
> >  ---------------
> >  - Allocate all entries,
> >  - Compress all entries,
> >  - Store all entries in xarray/LRU.
> >
> > Hence, the above structure is maintained only for batched stores, and the
> > following structure is implemented for sequential stores of large folio pages,
> > that fixes the zstd regression, while preserving common code paths for
> batched
> > and sequential stores of a folio:
> >
> >  Sequential stores:
> >  ------------------
> >  For each page in folio:
> >   - allocate an entry,
> >   - compress the page,
> >   - store the entry in xarray/LRU.
> >
> > This is submitted as a separate patch only for code review purposes. I will
> > squash this with the previous commit in subsequent versions of this
> > patch-series.
> 
> Could it be the cache locality?

It is possible, since there are more LLC-store-misses in v6, patch 15 (that has
the regression) as compared to patch v6 16 (that fixes the regression).

> 
> I wonder if we should do what Chengming initially suggested and batch
> everything at ZSWAP_MAX_BATCH_SIZE instead. Instead of
> zswap_compress_folio() operating on the entire folio, we can operate on
> batches of size ZSWAP_MAX_BATCH_SIZE, regardless of whether the
> underlying compressor supports batching.
> 
> If we do this, instead of:
> - Allocate all entries
> - Compress all entries
> - Store all entries
> 
> We can do:
>   - For each batch (8 entries)
>   	- Allocate all entries
> 	- Compress all entries
> 	- Store all entries
> 
> This should help unify the code, and I suspect it may also fix the zstd
> regression. We can also skip the entries array allocation and use one on
> the stack.

I tried the above, and additional approaches to unify the code, and all of
them cause a regression in zstd throughput and latency in the usemem 30
processes experiments:

1) Operate on batches of size ZSWAP_MAX_BATCH_SIZE, regardless of
     whether the underlying compressor supports batching. I tried this without
     and with upfront allocation of all entries.
2) Unify post-compression zpool malloc, map handle, memcpy computes
    done in zswap_compress() and zswap_batch_compress() by moving them
    to a procedure and calling this from both these functions. Tried without
    and with inlining this procedure.
3) With the v7 changes to make the acomp_ctx resources' lifetime be from
    pool creation to pool deletion, I wanted to make another functional change in
    acomp_ctx_get_cpu_lock(), as described in the code comments in v7.
    However, I found that making just this change in zswap_decompress() caused
    zstd regression:

     do { 
                   acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
     } while (!acomp_ctx);

		/*
		 * If the CPU onlining code successfully allocates acomp_ctx resources,
		 * it sets acomp_ctx->initialized to true. Until this happens, we have
		 * two options:
		 *
		 * 1. Return NULL and fail all stores on this CPU.
		 * 2. Retry, until onlining has finished allocating resources.
		 *
		 * In theory, option 1 could be more appropriate, because it
		 * allows the calling procedure to decide how it wants to handle
		 * reclaim racing with CPU hotplug. For instance, it might be Ok
		 * for compress to return an error for the backing swap device
		 * to store the folio. Decompress could wait until we get a
		 * valid and locked mutex after onlining has completed. For now,
		 * we go with option 2 because adding a do-while in
		 * zswap_decompress() adds latency for software compressors.
		*/ 

My main learning from these attempts to try alternate approaches to resolve
the zstd regression and to unify common computes into distinct procedures
that zswap_compress() and zswap_batch_compress() will each call, is this:
zstd is extremely sensitive to any and all compute overheads. The above
approaches yield approx.. these zstd latencies with 64K folios quite consistently
(I will summarize the mm-unstable data in v7):

Total throughput (KB/s)              6,909,838       
Average throughput (KB/s)        230,327
elapsed time (sec)                       94.58
sys time (sec)                               2,358.36

The v7 implementation of zswap_store_folios() consistently gives better latency:

Total throughput (KB/s)              6,939,253      
Average throughput (KB/s)        231,308
elapsed time (sec)                       88.64
sys time (sec)                               2,197.54

My main goal in v7 has been to not regress zstd by introducing compute
overheads, while deriving the benefits of batching with request chaining,
while trying to preserve the v6 structure of zswap_store_folio() that aims
to strike a balance between:

1) Common code for batching and sequential.
2) Making that common code "behave" sequentially so that zstd does not
    see a latency regression.

I would appreciate it if you can review the v7 implementation and let me
know if this would be acceptable.

Thanks,
Kanchana