[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <PH7PR11MB81213C0A34D7DAE5AE7A4239C98AA@PH7PR11MB8121.namprd11.prod.outlook.com>
Date: Fri, 9 May 2025 18:29:32 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Nhat Pham <nphamcs@...il.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
<hannes@...xchg.org>, "yosry.ahmed@...ux.dev" <yosry.ahmed@...ux.dev>,
"chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com"
<ryan.roberts@....com>, "21cnbao@...il.com" <21cnbao@...il.com>,
"ying.huang@...ux.alibaba.com" <ying.huang@...ux.alibaba.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"senozhatsky@...omium.org" <senozhatsky@...omium.org>,
"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
"herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>,
"davem@...emloft.net" <davem@...emloft.net>, "clabbe@...libre.com"
<clabbe@...libre.com>, "ardb@...nel.org" <ardb@...nel.org>,
"ebiggers@...gle.com" <ebiggers@...gle.com>, "surenb@...gle.com"
<surenb@...gle.com>, "Accardi, Kristen C" <kristen.c.accardi@...el.com>,
"Gomes, Vinicius" <vinicius.gomes@...el.com>, "Feghali, Wajdi K"
<wajdi.k.feghali@...el.com>, "Gopal, Vinodh" <vinodh.gopal@...el.com>,
"Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
Subject: RE: [RESEND PATCH v9 00/19] zswap compression batching
> -----Original Message-----
> From: Nhat Pham <nphamcs@...il.com>
> Sent: Thursday, May 8, 2025 2:20 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; yosry.ahmed@...ux.dev; chengming.zhou@...ux.dev;
> usamaarif642@...il.com; ryan.roberts@....com; 21cnbao@...il.com;
> ying.huang@...ux.alibaba.com; akpm@...ux-foundation.org;
> senozhatsky@...omium.org; linux-crypto@...r.kernel.org;
> herbert@...dor.apana.org.au; davem@...emloft.net;
> clabbe@...libre.com; ardb@...nel.org; ebiggers@...gle.com;
> surenb@...gle.com; Accardi, Kristen C <kristen.c.accardi@...el.com>;
> Gomes, Vinicius <vinicius.gomes@...el.com>; Feghali, Wajdi K
> <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>
> Subject: Re: [RESEND PATCH v9 00/19] zswap compression batching
>
> On Thu, May 8, 2025 at 1:55 PM Nhat Pham <nphamcs@...il.com> wrote:
> >
> > On Thu, May 8, 2025 at 12:41 PM Kanchana P Sridhar
> > <kanchana.p.sridhar@...el.com> wrote:
> > >
> > >
> > > Compression Batching:
> > > =====================
> > >
> > > This patch-series introduces batch compression of pages in large folios to
> > > improve zswap swapout latency. It preserves the existing zswap protocols
> > > for non-batching software compressors by calling crypto_acomp
> sequentially
> > > per page in the batch. Additionally, in support of hardware accelerators
> > > that can process a batch as an integral unit, the patch-series creates
> > > generic batching interfaces in crypto_acomp, and calls the
> > > crypto_acomp_batch_compress() interface in zswap_compress() for
> compressors
> > > that intrinsically support batching.
> > >
> > > The patch series provides a proof point by using the Intel Analytics
> > > Accelerator (IAA) for implementing the compress/decompress batching
> API
> > > using hardware parallelism in the iaa_crypto driver and another proof
> point
> > > with a sequential software compressor, zstd.
> >
> > Any plan on doing hardware accelerated/offloaded/parallelized zstd? :)
> >
> > >
> > > SUMMARY:
> > > ========
> > >
> > > The first proof point is to test with IAA using a sequential call (fully
> > > synchronous, compress one page at a time) vs. a batching call (fully
> > > asynchronous, submit a batch to IAA for parallel compression, then poll
> for
> > > completion statuses).
> > >
> > > The performance testing data with usemem 30 processes and kernel
> > > compilation test using 32 threads, show 67%-77% throughput gains and
> > > 28%-32% sys time reduction (usemem30) and 2-3% sys time reduction
> > > (kernel compilation) with zswap_store() large folios using IAA compress
> > > batching as compared to IAA sequential.
> > >
> > > The second proof point is to make sure that software algorithms such as
> > > zstd do not regress. The data indicates that for sequential software
> > > algorithms a performance gain is achieved.
> > >
> > > With the performance optimizations implemented in patches 18 and 19
> of
> > > v9, zstd usemem30 throughput increases by 1%, along with a 6%-8% sys
> time
> > > reduction. With kernel compilation using zstd, we get a 0.4%-3.2%
> > > reduction in sys time. These optimizations pertain to common code
> > > paths, removing redundant branches/computes, using prefetchw() of
> the
> > > zswap entry before it is written, and selectively annotating branches
> > > with likely()/unlikely() compiler directives to minimize branch
> > > mis-prediction penalty. Additionally, using the batching code for
> > > non-batching compressors to sequentially compress/store batches of up
> > > to ZSWAP_MAX_BATCH_SIZE (8) pages seems to help, most likely due to
> > > cache locality of working set structures such as the array of
> > > zswap_entry-s for the batch.
> >
> > Nice!
> >
> > >
> > > Our internal validation of zstd with the batching interface vs. IAA with
> > > the batching interface on Emerald Rapids has shown that IAA
> > > compress/decompress batching gives 21.3% more memory savings as
> compared
> > > to zstd, for 5% performance loss as compared to the baseline without
> any
> > > memory pressure. IAA batching demonstrates more than 2X the
> memory
> > > savings obtained by zstd at this 95% performance KPI.
> > > The compression ratio with IAA is 2.23, and with zstd 2.96. Even with
> > > this compression ratio deficit for IAA, batching is extremely
> >
> > I'm confused. How does IAA give more memory savings, while having a
> > worse compression ratio? How do you define memory savings here?
> >
> > > beneficial. As we improve the compression ratio of the IAA accelerator,
> > > we expect to see even better memory savings with IAA as compared to
> > > software compressors.
> > >
> > >
> > > Batching Roadmap:
> > > =================
> > >
> > > 1) Compression batching within large folios (this series).
> > >
> > > 2) Reclaim batching of hybrid folios:
> > >
> > > We can expect to see even more significant performance and
> throughput
> > > improvements if we use the parallelism offered by IAA to do reclaim
> > > batching of 4K/large folios (really any-order folios), and using the
> > > zswap_store() high throughput compression pipeline to batch-compress
> > > pages comprising these folios, not just batching within large
> > > folios. This is the reclaim batching patch 13 in v1, which we expect
> > > to submit in a separate patch-series.
> >
> > Are you aware of the current kcompressd work:
> >
> > https://lore.kernel.org/all/20250430082651.3152444-1-qun-
> wei.lin@...iatek.com/
> >
> > It basically offloads compression work into a separate kernel thread
> > (kcompressd), for kswapd reclaim.
> >
> > This might provide you with a more natural place to perform batch
> > compression - instead of compressing one page at a time from the
> > worker thread's queue, you can grab a batch worth of pages and feed it
> > to IAA.
> >
> > Downside is it only applies to indirect reclaim. Proactive and direct
> > reclaimers are not covered, unfortunately.
> >
> > >
> > > 3) Decompression batching:
> > >
> > > We have developed a zswap load batching interface for IAA to be used
> > > for parallel decompression batching, using swapin_readahead().
> > >
> > > These capabilities are architected so as to be useful to zswap and
> > > zram. We are actively working on integrating these components with
> zram.
> >
> > Yeah problem with readahead is you can potentially get different
> > backends in the batch, and modifying readahead code is pretty ugly :)
> > But we'll see...
> >
>
> Another place where you can do decompression batching is for zswap
> writeback :) Right now, we are decompressing the pages and writing
> them back one page at a time. You can, however, grab a batch worth of
> them, feed to IAA for processing, before submitting them all for IO :)
Thanks Nhat, great idea!
>
> I have a prototype that perform batch writeback (mostly for IO
> efficiency purpose) - lmk if you want to play with it. Problem, as
> usual, is benchmarking :)
Sure. Please do share the patch implementing the batch writeback
and I can give this a try.
Thanks,
Kanchana
Powered by blists - more mailing lists