[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=PS7mjuhaazydkE2TOVa5DWQu9521FqH4aXi0yptZQaeA@mail.gmail.com>
Date: Fri, 30 Jan 2026 17:12:35 -0800
From: Nhat Pham <nphamcs@...il.com>
To: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, hannes@...xchg.org,
yosry.ahmed@...ux.dev, chengming.zhou@...ux.dev, usamaarif642@...il.com,
ryan.roberts@....com, 21cnbao@...il.com, ying.huang@...ux.alibaba.com,
akpm@...ux-foundation.org, senozhatsky@...omium.org, sj@...nel.org,
kasong@...cent.com, linux-crypto@...r.kernel.org, herbert@...dor.apana.org.au,
davem@...emloft.net, clabbe@...libre.com, ardb@...nel.org,
ebiggers@...gle.com, surenb@...gle.com, kristen.c.accardi@...el.com,
vinicius.gomes@...el.com, giovanni.cabiddu@...el.com,
wajdi.k.feghali@...el.com
Subject: Re: [PATCH v14 26/26] mm: zswap: Batched zswap_compress() for
compress batching of large folios.
On Sat, Jan 24, 2026 at 7:36 PM Kanchana P Sridhar
<kanchana.p.sridhar@...el.com> wrote:
>
> We introduce a new batching implementation of zswap_compress() for
> compressors that do and do not support batching. This eliminates code
> duplication and facilitates code maintainability with the introduction
> of compress batching.
>
> The vectorized implementation of calling the earlier zswap_compress()
> sequentially, one page at a time in zswap_store_pages(), is replaced
> with this new version of zswap_compress() that accepts multiple pages to
> compress as a batch.
>
> If the compressor does not support batching, each page in the batch is
> compressed and stored sequentially. If the compressor supports batching,
> for e.g., 'deflate-iaa', the Intel IAA hardware accelerator, the batch
> is compressed in parallel in hardware.
>
> If the batch is compressed without errors, the compressed buffers for
> the batch are stored in zsmalloc. In case of compression errors, the
> current behavior based on whether the folio is enabled for zswap
> writeback, is preserved.
>
> The batched zswap_compress() incorporates Herbert's suggestion for
> SG lists to represent the batch's inputs/outputs to interface with the
> crypto API [1].
>
> Performance data:
> =================
> As suggested by Barry, this is the performance data gathered on Intel
> Sapphire Rapids with two workloads:
>
> 1) 30 usemem processes in a 150 GB memory limited cgroup, each
> allocates 10G, i.e, effectively running at 50% memory pressure.
> 2) kernel_compilation "defconfig", 32 threads, cgroup memory limit set
> to 1.7 GiB (50% memory pressure, since baseline memory usage is 3.4
> GiB): data averaged across 10 runs.
>
> To keep comparisons simple, all testing was done without the
> zswap shrinker.
>
> =========================================================================
> IAA mm-unstable-1-23-2026 v14
> =========================================================================
> zswap compressor deflate-iaa deflate-iaa IAA Batching
> vs.
> IAA Sequential
> =========================================================================
> usemem30, 64K folios:
>
> Total throughput (KB/s) 6,226,967 10,551,714 69%
> Average throughput (KB/s) 207,565 351,723 69%
> elapsed time (sec) 99.19 67.45 -32%
> sys time (sec) 2,356.19 1,580.47 -33%
>
> usemem30, PMD folios:
>
> Total throughput (KB/s) 6,347,201 11,315,500 78%
> Average throughput (KB/s) 211,573 377,183 78%
> elapsed time (sec) 88.14 63.37 -28%
> sys time (sec) 2,025.53 1,455.23 -28%
>
> kernel_compilation, 64K folios:
>
> elapsed time (sec) 100.10 98.74 -1.4%
> sys time (sec) 308.72 301.23 -2%
>
> kernel_compilation, PMD folios:
>
> elapsed time (sec) 95.29 93.44 -1.9%
> sys time (sec) 346.21 344.48 -0.5%
> =========================================================================
>
> =========================================================================
> ZSTD mm-unstable-1-23-2026 v14
> =========================================================================
> zswap compressor zstd zstd v14 ZSTD
> Improvement
> =========================================================================
> usemem30, 64K folios:
>
> Total throughput (KB/s) 6,032,326 6,047,448 0.3%
> Average throughput (KB/s) 201,077 201,581 0.3%
> elapsed time (sec) 97.52 95.33 -2.2%
> sys time (sec) 2,415.40 2,328.38 -4%
>
> usemem30, PMD folios:
>
> Total throughput (KB/s) 6,570,404 6,623,962 0.8%
> Average throughput (KB/s) 219,013 220,798 0.8%
> elapsed time (sec) 89.17 88.25 -1%
> sys time (sec) 2,126.69 2,043.08 -4%
>
> kernel_compilation, 64K folios:
>
> elapsed time (sec) 100.89 99.98 -0.9%
> sys time (sec) 417.49 414.62 -0.7%
>
> kernel_compilation, PMD folios:
>
> elapsed time (sec) 98.26 97.38 -0.9%
> sys time (sec) 487.14 473.16 -2.9%
> =========================================================================
The rest of the patch changelog (architectural and future
considerations) can stay in the cover letter. Let's not duplicate
information :)
Keep the patch changelog limited to only the changes in the patch
itself (unless we need some clarifications imminently relevant).
I'll review the remainder of the patch later :)
Powered by blists - more mailing lists