[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dhw3loxu2myculgk5vhpsbe5nupzvtnkei7u4zknc5ce5c6w62@csablv6vl2e5>
Date: Wed, 4 Feb 2026 18:17:45 +0000
From: Yosry Ahmed <yosry.ahmed@...ux.dev>
To: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, hannes@...xchg.org,
nphamcs@...il.com, chengming.zhou@...ux.dev, usamaarif642@...il.com,
ryan.roberts@....com, 21cnbao@...il.com, ying.huang@...ux.alibaba.com,
akpm@...ux-foundation.org, senozhatsky@...omium.org, sj@...nel.org, kasong@...cent.com,
linux-crypto@...r.kernel.org, herbert@...dor.apana.org.au, davem@...emloft.net,
clabbe@...libre.com, ardb@...nel.org, ebiggers@...gle.com, surenb@...gle.com,
kristen.c.accardi@...el.com, vinicius.gomes@...el.com, giovanni.cabiddu@...el.com,
wajdi.k.feghali@...el.com
Subject: Re: [PATCH v14 26/26] mm: zswap: Batched zswap_compress() for
compress batching of large folios.
On Sat, Jan 24, 2026 at 07:35:37PM -0800, Kanchana P Sridhar wrote:
> We introduce a new batching implementation of zswap_compress() for
> compressors that do and do not support batching. This eliminates code
> duplication and facilitates code maintainability with the introduction
> of compress batching.
>
> The vectorized implementation of calling the earlier zswap_compress()
> sequentially, one page at a time in zswap_store_pages(), is replaced
> with this new version of zswap_compress() that accepts multiple pages to
> compress as a batch.
>
> If the compressor does not support batching, each page in the batch is
> compressed and stored sequentially. If the compressor supports batching,
> for e.g., 'deflate-iaa', the Intel IAA hardware accelerator, the batch
> is compressed in parallel in hardware.
>
> If the batch is compressed without errors, the compressed buffers for
> the batch are stored in zsmalloc. In case of compression errors, the
> current behavior based on whether the folio is enabled for zswap
> writeback, is preserved.
>
> The batched zswap_compress() incorporates Herbert's suggestion for
> SG lists to represent the batch's inputs/outputs to interface with the
> crypto API [1].
>
> Performance data:
> =================
> As suggested by Barry, this is the performance data gathered on Intel
> Sapphire Rapids with two workloads:
>
> 1) 30 usemem processes in a 150 GB memory limited cgroup, each
> allocates 10G, i.e, effectively running at 50% memory pressure.
> 2) kernel_compilation "defconfig", 32 threads, cgroup memory limit set
> to 1.7 GiB (50% memory pressure, since baseline memory usage is 3.4
> GiB): data averaged across 10 runs.
>
> To keep comparisons simple, all testing was done without the
> zswap shrinker.
>
> =========================================================================
> IAA mm-unstable-1-23-2026 v14
> =========================================================================
> zswap compressor deflate-iaa deflate-iaa IAA Batching
> vs.
> IAA Sequential
> =========================================================================
> usemem30, 64K folios:
>
> Total throughput (KB/s) 6,226,967 10,551,714 69%
> Average throughput (KB/s) 207,565 351,723 69%
> elapsed time (sec) 99.19 67.45 -32%
> sys time (sec) 2,356.19 1,580.47 -33%
>
> usemem30, PMD folios:
>
> Total throughput (KB/s) 6,347,201 11,315,500 78%
> Average throughput (KB/s) 211,573 377,183 78%
> elapsed time (sec) 88.14 63.37 -28%
> sys time (sec) 2,025.53 1,455.23 -28%
>
> kernel_compilation, 64K folios:
>
> elapsed time (sec) 100.10 98.74 -1.4%
> sys time (sec) 308.72 301.23 -2%
>
> kernel_compilation, PMD folios:
>
> elapsed time (sec) 95.29 93.44 -1.9%
> sys time (sec) 346.21 344.48 -0.5%
> =========================================================================
>
> =========================================================================
> ZSTD mm-unstable-1-23-2026 v14
> =========================================================================
> zswap compressor zstd zstd v14 ZSTD
> Improvement
> =========================================================================
> usemem30, 64K folios:
>
> Total throughput (KB/s) 6,032,326 6,047,448 0.3%
> Average throughput (KB/s) 201,077 201,581 0.3%
> elapsed time (sec) 97.52 95.33 -2.2%
> sys time (sec) 2,415.40 2,328.38 -4%
>
> usemem30, PMD folios:
>
> Total throughput (KB/s) 6,570,404 6,623,962 0.8%
> Average throughput (KB/s) 219,013 220,798 0.8%
> elapsed time (sec) 89.17 88.25 -1%
> sys time (sec) 2,126.69 2,043.08 -4%
>
> kernel_compilation, 64K folios:
>
> elapsed time (sec) 100.89 99.98 -0.9%
> sys time (sec) 417.49 414.62 -0.7%
>
> kernel_compilation, PMD folios:
>
> elapsed time (sec) 98.26 97.38 -0.9%
> sys time (sec) 487.14 473.16 -2.9%
> =========================================================================
>
> Architectural considerations for the zswap batching framework:
> ==============================================================
> We have designed the zswap batching framework to be
> hardware-agnostic. It has no dependencies on Intel-specific features and
> can be leveraged by any hardware accelerator or software-based
> compressor. In other words, the framework is open and inclusive by
> design.
>
> Potential future clients of the batching framework:
> ===================================================
> This patch-series demonstrates the performance benefits of compression
> batching when used in zswap_store() of large folios. Compression
> batching can be used for other use cases such as batching compression in
> zram, batch compression of different folios during reclaim, kcompressd,
> file systems, etc. Decompression batching can be used to improve
> efficiency of zswap writeback (Thanks Nhat for this idea), batching
> decompressions in zram, etc.
>
> Experiments with kernel_compilation "allmodconfig" that combine zswap
> compress batching, folio reclaim batching, and writeback batching show
> that 0 pages are written back with deflate-iaa and zstd. For comparison,
> the baselines for these compressors see 200K-800K pages written to disk.
> Reclaim batching relieves memory pressure faster than reclaiming one
> folio at a time, hence alleviates the need to scan slab memory for
> writeback.
>
> [1]: https://lore.kernel.org/all/aJ7Fk6RpNc815Ivd@gondor.apana.org.au/T/#m99aea2ce3d284e6c5a3253061d97b08c4752a798
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>
Herbert, could you please review this patch since most of it is using
new crypto APIs?
Thanks!
Powered by blists - more mailing lists