linux-kernel - RE: [PATCH v9 10/19] crypto: acomp - New interfaces to facilitate batching support in acomp & drivers.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <PH7PR11MB8121D578B8BAA95B07D6EF8CC98D2@PH7PR11MB8121.namprd11.prod.outlook.com>
Date: Fri, 2 May 2025 15:53:07 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Herbert Xu <herbert@...dor.apana.org.au>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
	<hannes@...xchg.org>, "yosry.ahmed@...ux.dev" <yosry.ahmed@...ux.dev>,
	"nphamcs@...il.com" <nphamcs@...il.com>, "chengming.zhou@...ux.dev"
	<chengming.zhou@...ux.dev>, "usamaarif642@...il.com"
	<usamaarif642@...il.com>, "ryan.roberts@....com" <ryan.roberts@....com>,
	"21cnbao@...il.com" <21cnbao@...il.com>, "ying.huang@...ux.alibaba.com"
	<ying.huang@...ux.alibaba.com>, "akpm@...ux-foundation.org"
	<akpm@...ux-foundation.org>, "linux-crypto@...r.kernel.org"
	<linux-crypto@...r.kernel.org>, "davem@...emloft.net" <davem@...emloft.net>,
	"clabbe@...libre.com" <clabbe@...libre.com>, "ardb@...nel.org"
	<ardb@...nel.org>, "ebiggers@...gle.com" <ebiggers@...gle.com>,
	"surenb@...gle.com" <surenb@...gle.com>, "Accardi, Kristen C"
	<kristen.c.accardi@...el.com>, "Feghali, Wajdi K"
	<wajdi.k.feghali@...el.com>, "Gopal, Vinodh" <vinodh.gopal@...el.com>,
	"Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v9 10/19] crypto: acomp - New interfaces to facilitate
 batching support in acomp & drivers.


> -----Original Message-----
> From: Herbert Xu <herbert@...dor.apana.org.au>
> Sent: Wednesday, April 30, 2025 6:41 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; yosry.ahmed@...ux.dev; nphamcs@...il.com;
> chengming.zhou@...ux.dev; usamaarif642@...il.com;
> ryan.roberts@....com; 21cnbao@...il.com;
> ying.huang@...ux.alibaba.com; akpm@...ux-foundation.org; linux-
> crypto@...r.kernel.org; davem@...emloft.net; clabbe@...libre.com;
> ardb@...nel.org; ebiggers@...gle.com; surenb@...gle.com; Accardi,
> Kristen C <kristen.c.accardi@...el.com>; Feghali, Wajdi K
> <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>
> Subject: Re: [PATCH v9 10/19] crypto: acomp - New interfaces to facilitate
> batching support in acomp & drivers.
> 
> On Wed, Apr 30, 2025 at 01:52:56PM -0700, Kanchana P Sridhar wrote:
> >
> > @@ -127,6 +131,22 @@ struct acomp_req {
> >  struct crypto_acomp {
> >  	int (*compress)(struct acomp_req *req);
> >  	int (*decompress)(struct acomp_req *req);
> > +	unsigned int (*get_batch_size)(void);
> > +	bool (*batch_compress)(
> > +		struct acomp_req *reqs[],
> > +		struct page *pages[],
> > +		u8 *dsts[],
> > +		unsigned int dlens[],
> > +		int errors[],
> > +		int nr_reqs);
> > +	bool (*batch_decompress)(
> > +		struct acomp_req *reqs[],
> > +		u8 *srcs[],
> > +		struct page *pages[],
> > +		unsigned int slens[],
> > +		unsigned int dlens[],
> > +		int errors[],
> > +		int nr_reqs);
> 
> I shelved request chaining because allocating one request per page
> is actively harmful to performance.  So we should not add any
> interface that is based on one request per page.

Hi Herbert,

My cover letter presents data that I've gathered that indicates significant
performance improvements with the crypto_acomp_batch_compress()
interface that allocates one request per page.

In addition, I would also like to share the p50/p99 latency of just the calls
to crypto_acomp_compress() and crypto_acomp_batch_compress() that
I gathered using the silesia.tar dataset (http://wanos.co/assets/silesia.tar)
and a simple madvise test that reads the dataset into memory, then
swaps out all pages and swaps them back in.

This data is on Sapphire Rapids, core frequency fixed at 2500 MHz.
I have enabled 64K folios.
The "count" refers to the number of calls to the crypto_acomp API.
As expected, the deflate-iaa "count" values in v9 are much lower
because zswap_compress() in v9 uses compression batching, i.e.,
invokes crypto_acomp_batch_compress() with batches of 8 pages,
while storing the 64K folios.

     -------------------------------------------------------------------------
     64K folios:    Normalized Per-Page Latency of crypto_acomp
                           calls in zswap_compress() (ns)
     ------------+------------------------------+----------------------------
                 |     mm-unstable-4-21-2025    |              v9
     ------------+------------------------------+----------------------------
                 |   count       p50       p99  |   count      p50       p99
     ------------+------------------------------+----------------------------
     deflate-iaa |  50,459     3,396     3,663  |   6,379      717       774
                 |                              |
     zstd        |  50,631    27,610    33,006  |  50,631   27,253    32,516
     ------------+------------------------------+----------------------------

There is no indication of sending one acomp request per page
harming performance, with either deflate-iaa or zstd. We see a
4.7X speedup with IAA that uses the crypto_acomp_batch_compress()
interface in question.

> 
> My plan is to supply a whole folio through acomp_request_set_src_folio
> and mark it as a batch request with a data unit size of 4K, e.g.:
> 
> 	acomp_request_set_src_folio(req, folio, 0, len);
> 	acomp_request_set_data_unit(req, 4096);
> 
> Then the algorithm can dice it up in whatever way it sees fit.  For
> algorithms that don't support batching, the acompress API should dice
> it up and feed it to the algorithm piece-meal.
> 
> IOW the folio loop in zswap_store would be moved into the Crypto API.
> 
> This is contingent on one API change, bringing back NULL dst support
> to acompress.  This way zswap does not need to worry about allocating
> memory that might not even be needed (when pages compress well).
> 
> This won't look like the useless NULL dst we had before which simply
> pre-allocated memory rather than allocating them on demand.
> 
> What acompress should do is allocate one dst page at a time, once that
> is filled up, then allocate one more.  They should be chained up in an
> SG list.  Pages that do not compress can be marked as a zero-length
> entry in the SG list.
> 
> If the allocation fails at any point in time, simply stop the
> batching at that point and return the SG list of what has been
> compressed so far.  After processing the returned pages, zswap
> can then call acompress again with an offset into the folio to
> continue compression.

I am not sure if this would be feasible, because zswap_store()
incrementally does other book-keeping necessary for mm/swap
consistency as pages get compressed, such as allocating entries,
storing compressed buffers in zpool, updating the xarray of swap
offsets stored in zswap, adding the entry to the zswap memcg LRU
list, etc.

As soon as an error is encountered in zswap_compress(),
zswap_store() has to cleanup any prior zpool stores for the batch,
and delete any swap offsets for the folio from xarray.

Imo, handing over the folio store loop to crypto might make this
"maintaining consistency of mm/swap incrementally with each
page compressed/stored" somewhat messy. However, I would like
to request the zswap maintainers to weigh in with their insights
on pros/cons of what you are proposing.

> 
> To prevent pathological cases of zero progress, zswap can provide
> one pre-allocated page to seed the process.  For iaa, it should
> just allocate as many pages as it needs for batching, and if that
> fails, simply fall back to no batching and do things one page at
> a time (or however many pages you manage to allocate).

I'm somewhat concerned about a) allocating memory and b) adding
computes in the zswap_store() path. It is not clear what is the
motivating factor for doing so, especially because the solution and
data presented in v9 have indicated only performance upside with
the crypto_acomp_batch_compress() API and its implementation
in iaa_crypto, and modest performance gains with zstd using the
existing crypto_acomp_compress() API to compress one page at a
time in a large folio. Please let me know if I am missing something.

Thanks,
Kanchana

> 
> I'll whip up a quick POC and we can work on top of it.
> 
> Cheers,
> --
> Email: Herbert Xu <herbert@...dor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt