[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=MybjpmXVxM3QbfdQyXOv2xq87CZKzh1w2pdxucwSMttA@mail.gmail.com>
Date: Thu, 8 May 2025 14:20:04 -0700
From: Nhat Pham <nphamcs@...il.com>
To: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, hannes@...xchg.org,
yosry.ahmed@...ux.dev, chengming.zhou@...ux.dev, usamaarif642@...il.com,
ryan.roberts@....com, 21cnbao@...il.com, ying.huang@...ux.alibaba.com,
akpm@...ux-foundation.org, senozhatsky@...omium.org,
linux-crypto@...r.kernel.org, herbert@...dor.apana.org.au,
davem@...emloft.net, clabbe@...libre.com, ardb@...nel.org,
ebiggers@...gle.com, surenb@...gle.com, kristen.c.accardi@...el.com,
vinicius.gomes@...el.com, wajdi.k.feghali@...el.com, vinodh.gopal@...el.com
Subject: Re: [RESEND PATCH v9 00/19] zswap compression batching
On Thu, May 8, 2025 at 1:55 PM Nhat Pham <nphamcs@...il.com> wrote:
>
> On Thu, May 8, 2025 at 12:41 PM Kanchana P Sridhar
> <kanchana.p.sridhar@...el.com> wrote:
> >
> >
> > Compression Batching:
> > =====================
> >
> > This patch-series introduces batch compression of pages in large folios to
> > improve zswap swapout latency. It preserves the existing zswap protocols
> > for non-batching software compressors by calling crypto_acomp sequentially
> > per page in the batch. Additionally, in support of hardware accelerators
> > that can process a batch as an integral unit, the patch-series creates
> > generic batching interfaces in crypto_acomp, and calls the
> > crypto_acomp_batch_compress() interface in zswap_compress() for compressors
> > that intrinsically support batching.
> >
> > The patch series provides a proof point by using the Intel Analytics
> > Accelerator (IAA) for implementing the compress/decompress batching API
> > using hardware parallelism in the iaa_crypto driver and another proof point
> > with a sequential software compressor, zstd.
>
> Any plan on doing hardware accelerated/offloaded/parallelized zstd? :)
>
> >
> > SUMMARY:
> > ========
> >
> > The first proof point is to test with IAA using a sequential call (fully
> > synchronous, compress one page at a time) vs. a batching call (fully
> > asynchronous, submit a batch to IAA for parallel compression, then poll for
> > completion statuses).
> >
> > The performance testing data with usemem 30 processes and kernel
> > compilation test using 32 threads, show 67%-77% throughput gains and
> > 28%-32% sys time reduction (usemem30) and 2-3% sys time reduction
> > (kernel compilation) with zswap_store() large folios using IAA compress
> > batching as compared to IAA sequential.
> >
> > The second proof point is to make sure that software algorithms such as
> > zstd do not regress. The data indicates that for sequential software
> > algorithms a performance gain is achieved.
> >
> > With the performance optimizations implemented in patches 18 and 19 of
> > v9, zstd usemem30 throughput increases by 1%, along with a 6%-8% sys time
> > reduction. With kernel compilation using zstd, we get a 0.4%-3.2%
> > reduction in sys time. These optimizations pertain to common code
> > paths, removing redundant branches/computes, using prefetchw() of the
> > zswap entry before it is written, and selectively annotating branches
> > with likely()/unlikely() compiler directives to minimize branch
> > mis-prediction penalty. Additionally, using the batching code for
> > non-batching compressors to sequentially compress/store batches of up
> > to ZSWAP_MAX_BATCH_SIZE (8) pages seems to help, most likely due to
> > cache locality of working set structures such as the array of
> > zswap_entry-s for the batch.
>
> Nice!
>
> >
> > Our internal validation of zstd with the batching interface vs. IAA with
> > the batching interface on Emerald Rapids has shown that IAA
> > compress/decompress batching gives 21.3% more memory savings as compared
> > to zstd, for 5% performance loss as compared to the baseline without any
> > memory pressure. IAA batching demonstrates more than 2X the memory
> > savings obtained by zstd at this 95% performance KPI.
> > The compression ratio with IAA is 2.23, and with zstd 2.96. Even with
> > this compression ratio deficit for IAA, batching is extremely
>
> I'm confused. How does IAA give more memory savings, while having a
> worse compression ratio? How do you define memory savings here?
>
> > beneficial. As we improve the compression ratio of the IAA accelerator,
> > we expect to see even better memory savings with IAA as compared to
> > software compressors.
> >
> >
> > Batching Roadmap:
> > =================
> >
> > 1) Compression batching within large folios (this series).
> >
> > 2) Reclaim batching of hybrid folios:
> >
> > We can expect to see even more significant performance and throughput
> > improvements if we use the parallelism offered by IAA to do reclaim
> > batching of 4K/large folios (really any-order folios), and using the
> > zswap_store() high throughput compression pipeline to batch-compress
> > pages comprising these folios, not just batching within large
> > folios. This is the reclaim batching patch 13 in v1, which we expect
> > to submit in a separate patch-series.
>
> Are you aware of the current kcompressd work:
>
> https://lore.kernel.org/all/20250430082651.3152444-1-qun-wei.lin@mediatek.com/
>
> It basically offloads compression work into a separate kernel thread
> (kcompressd), for kswapd reclaim.
>
> This might provide you with a more natural place to perform batch
> compression - instead of compressing one page at a time from the
> worker thread's queue, you can grab a batch worth of pages and feed it
> to IAA.
>
> Downside is it only applies to indirect reclaim. Proactive and direct
> reclaimers are not covered, unfortunately.
>
> >
> > 3) Decompression batching:
> >
> > We have developed a zswap load batching interface for IAA to be used
> > for parallel decompression batching, using swapin_readahead().
> >
> > These capabilities are architected so as to be useful to zswap and
> > zram. We are actively working on integrating these components with zram.
>
> Yeah problem with readahead is you can potentially get different
> backends in the batch, and modifying readahead code is pretty ugly :)
> But we'll see...
>
Another place where you can do decompression batching is for zswap
writeback :) Right now, we are decompressing the pages and writing
them back one page at a time. You can, however, grab a batch worth of
them, feed to IAA for processing, before submitting them all for IO :)
I have a prototype that perform batch writeback (mostly for IO
efficiency purpose) - lmk if you want to play with it. Problem, as
usual, is benchmarking :)
Powered by blists - more mailing lists