linux-kernel - [PATCH v14 26/26] mm: zswap: Batched zswap_compress() for compress batching of large folios.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20260125033537.334628-27-kanchana.p.sridhar@intel.com>
Date: Sat, 24 Jan 2026 19:35:37 -0800
From: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>
To: linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	hannes@...xchg.org,
	yosry.ahmed@...ux.dev,
	nphamcs@...il.com,
	chengming.zhou@...ux.dev,
	usamaarif642@...il.com,
	ryan.roberts@....com,
	21cnbao@...il.com,
	ying.huang@...ux.alibaba.com,
	akpm@...ux-foundation.org,
	senozhatsky@...omium.org,
	sj@...nel.org,
	kasong@...cent.com,
	linux-crypto@...r.kernel.org,
	herbert@...dor.apana.org.au,
	davem@...emloft.net,
	clabbe@...libre.com,
	ardb@...nel.org,
	ebiggers@...gle.com,
	surenb@...gle.com,
	kristen.c.accardi@...el.com,
	vinicius.gomes@...el.com,
	giovanni.cabiddu@...el.com
Cc: wajdi.k.feghali@...el.com,
	kanchana.p.sridhar@...el.com
Subject: [PATCH v14 26/26] mm: zswap: Batched zswap_compress() for compress batching of large folios.

We introduce a new batching implementation of zswap_compress() for
compressors that do and do not support batching. This eliminates code
duplication and facilitates code maintainability with the introduction
of compress batching.

The vectorized implementation of calling the earlier zswap_compress()
sequentially, one page at a time in zswap_store_pages(), is replaced
with this new version of zswap_compress() that accepts multiple pages to
compress as a batch.

If the compressor does not support batching, each page in the batch is
compressed and stored sequentially. If the compressor supports batching,
for e.g., 'deflate-iaa', the Intel IAA hardware accelerator, the batch
is compressed in parallel in hardware.

If the batch is compressed without errors, the compressed buffers for
the batch are stored in zsmalloc. In case of compression errors, the
current behavior based on whether the folio is enabled for zswap
writeback, is preserved.

The batched zswap_compress() incorporates Herbert's suggestion for
SG lists to represent the batch's inputs/outputs to interface with the
crypto API [1].

Performance data:
=================
As suggested by Barry, this is the performance data gathered on Intel
Sapphire Rapids with two workloads:

1) 30 usemem processes in a 150 GB memory limited cgroup, each
   allocates 10G, i.e, effectively running at 50% memory pressure.
2) kernel_compilation "defconfig", 32 threads, cgroup memory limit set
   to 1.7 GiB (50% memory pressure, since baseline memory usage is 3.4
   GiB): data averaged across 10 runs.

To keep comparisons simple, all testing was done without the
zswap shrinker.

 =========================================================================
  IAA                 mm-unstable-1-23-2026             v14
 =========================================================================
    zswap compressor            deflate-iaa     deflate-iaa   IAA Batching
                                                                  vs.
                                                            IAA Sequential
 =========================================================================
 usemem30, 64K folios:

    Total throughput (KB/s)       6,226,967      10,551,714       69%
    Average throughput (KB/s)       207,565         351,723       69%
    elapsed time (sec)                99.19           67.45      -32%
    sys time (sec)                 2,356.19        1,580.47      -33%

 usemem30, PMD folios:

    Total throughput (KB/s)       6,347,201      11,315,500       78%
    Average throughput (KB/s)       211,573         377,183       78%
    elapsed time (sec)                88.14           63.37      -28%
    sys time (sec)                 2,025.53        1,455.23      -28%

 kernel_compilation, 64K folios:

    elapsed time (sec)               100.10           98.74     -1.4%
    sys time (sec)                   308.72          301.23       -2%

 kernel_compilation, PMD folios:

    elapsed time (sec)                95.29           93.44     -1.9%
    sys time (sec)                   346.21          344.48     -0.5%
 =========================================================================

 =========================================================================
  ZSTD                mm-unstable-1-23-2026             v14
 =========================================================================
    zswap compressor                   zstd            zstd     v14 ZSTD
                                                             Improvement
 =========================================================================
 usemem30, 64K folios:

    Total throughput (KB/s)       6,032,326       6,047,448      0.3%
    Average throughput (KB/s)       201,077         201,581      0.3%
    elapsed time (sec)                97.52           95.33     -2.2%
    sys time (sec)                 2,415.40        2,328.38       -4%

 usemem30, PMD folios:

    Total throughput (KB/s)       6,570,404       6,623,962      0.8%
    Average throughput (KB/s)       219,013         220,798      0.8%
    elapsed time (sec)                89.17           88.25       -1%
    sys time (sec)                 2,126.69        2,043.08       -4%

 kernel_compilation, 64K folios:

    elapsed time (sec)               100.89           99.98     -0.9%
    sys time (sec)                   417.49          414.62     -0.7%

 kernel_compilation, PMD folios:

    elapsed time (sec)                98.26           97.38     -0.9%
    sys time (sec)                   487.14          473.16     -2.9%
 =========================================================================

Architectural considerations for the zswap batching framework:
==============================================================
We have designed the zswap batching framework to be
hardware-agnostic. It has no dependencies on Intel-specific features and
can be leveraged by any hardware accelerator or software-based
compressor. In other words, the framework is open and inclusive by
design.

Potential future clients of the batching framework:
===================================================
This patch-series demonstrates the performance benefits of compression
batching when used in zswap_store() of large folios. Compression
batching can be used for other use cases such as batching compression in
zram, batch compression of different folios during reclaim, kcompressd,
file systems, etc. Decompression batching can be used to improve
efficiency of zswap writeback (Thanks Nhat for this idea), batching
decompressions in zram, etc.

Experiments with kernel_compilation "allmodconfig" that combine zswap
compress batching, folio reclaim batching, and writeback batching show
that 0 pages are written back with deflate-iaa and zstd. For comparison,
the baselines for these compressors see 200K-800K pages written to disk.
Reclaim batching relieves memory pressure faster than reclaiming one
folio at a time, hence alleviates the need to scan slab memory for
writeback.

[1]: https://lore.kernel.org/all/aJ7Fk6RpNc815Ivd@gondor.apana.org.au/T/#m99aea2ce3d284e6c5a3253061d97b08c4752a798

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>
---
 mm/zswap.c | 260 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 190 insertions(+), 70 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 6a22add63220..399112af2c54 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -145,6 +145,7 @@ struct crypto_acomp_ctx {
 	struct acomp_req *req;
 	struct crypto_wait wait;
 	u8 **buffers;
+	struct sg_table *sg_table;
 	struct mutex mutex;
 };
 
@@ -272,6 +273,11 @@ static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8 nr_buffers)
 			kfree(acomp_ctx->buffers[i]);
 		kfree(acomp_ctx->buffers);
 	}
+
+	if (acomp_ctx->sg_table) {
+		sg_free_table(acomp_ctx->sg_table);
+		kfree(acomp_ctx->sg_table);
+	}
 }
 
 static struct zswap_pool *zswap_pool_create(char *compressor)
@@ -834,6 +840,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 	int nid = cpu_to_node(cpu);
+	struct scatterlist *sg;
 	int ret = -ENOMEM;
 	u8 i;
 
@@ -880,6 +887,22 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 			goto fail;
 	}
 
+	acomp_ctx->sg_table = kmalloc(sizeof(*acomp_ctx->sg_table),
+					GFP_KERNEL);
+	if (!acomp_ctx->sg_table)
+		goto fail;
+
+	if (sg_alloc_table(acomp_ctx->sg_table, pool->compr_batch_size,
+			   GFP_KERNEL))
+		goto fail;
+
+	/*
+	 * Statically map the per-CPU destination buffers to the per-CPU
+	 * SG lists.
+	 */
+	for_each_sg(acomp_ctx->sg_table->sgl, sg, pool->compr_batch_size, i)
+		sg_set_buf(sg, acomp_ctx->buffers[i], PAGE_SIZE);
+
 	/*
 	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
 	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
@@ -900,84 +923,177 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	return ret;
 }
 
-static bool zswap_compress(struct page *page, struct zswap_entry *entry,
-			   struct zswap_pool *pool, bool wb_enabled)
+/*
+ * zswap_compress() batching implementation for sequential and batching
+ * compressors.
+ *
+ * Description:
+ * ============
+ *
+ * Compress multiple @nr_pages in @folio starting from the @folio_start index in
+ * batches of @nr_batch_pages.
+ *
+ * It is assumed that @nr_pages <= ZSWAP_MAX_BATCH_SIZE. zswap_store() makes
+ * sure of this by design and zswap_store_pages() warns if this is not true.
+ *
+ * @nr_pages can be in (1, ZSWAP_MAX_BATCH_SIZE] even if the compressor does not
+ * support batching.
+ *
+ * If @nr_batch_pages is 1, each page is processed sequentially.
+ *
+ * If @nr_batch_pages is > 1, compression batching is invoked within
+ * the algorithm's driver, except if @nr_pages is 1: if so, the driver can
+ * choose to call it's sequential/non-batching compress routine.
+ *
+ * In both cases, if all compressions are successful, the compressed buffers
+ * are stored in zsmalloc.
+ *
+ * Design notes for batching compressors:
+ * ======================================
+ *
+ * Traversing SG lists when @nr_batch_pages is > 1 is expensive, and
+ * impacts batching performance if repeated:
+ *   - to map destination buffers to each SG list in @acomp_ctx->sg_table.
+ *   - to initialize each output @sg->length to PAGE_SIZE.
+ *
+ * Design choices made to optimize batching with SG lists:
+ *
+ * 1) The source folio pages in the batch are directly submitted to
+ *    crypto_acomp via acomp_request_set_src_folio().
+ *
+ * 2) The per-CPU @acomp_ctx->sg_table scatterlists are statically mapped
+ *    to the per-CPU dst @buffers at pool creation time.
+ *
+ * 3) zswap_compress() sets the output SG list length to PAGE_SIZE for
+ *    non-batching compressors. The batching compressor's driver should do this
+ *    as part of iterating through the dst SG lists for batch compression setup.
+ *
+ * Considerations for non-batching and batching compressors:
+ * =========================================================
+ *
+ * For each output SG list in @acomp_ctx->req->sg_table->sgl, the @sg->length
+ * should be set to either the page's compressed length (success), or it's
+ * compression error value.
+ */
+static bool zswap_compress(struct folio *folio,
+			   long folio_start,
+			   u8 nr_pages,
+			   u8 nr_batch_pages,
+			   struct zswap_entry *entries[],
+			   struct zs_pool *zs_pool,
+			   struct crypto_acomp_ctx *acomp_ctx,
+			   int nid,
+			   bool wb_enabled)
 {
-	struct crypto_acomp_ctx *acomp_ctx;
-	struct scatterlist input, output;
-	int comp_ret = 0, alloc_ret = 0;
-	unsigned int dlen = PAGE_SIZE;
+	gfp_t gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
+	unsigned int slen = nr_batch_pages * PAGE_SIZE;
+	u8 batch_start, batch_iter, compr_batch_size_iter;
+	struct scatterlist *sg;
 	unsigned long handle;
-	gfp_t gfp;
-	u8 *dst;
-	bool mapped = false;
-
-	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
-	mutex_lock(&acomp_ctx->mutex);
-
-	dst = acomp_ctx->buffers[0];
-	sg_init_table(&input, 1);
-	sg_set_page(&input, page, PAGE_SIZE, 0);
-
-	sg_init_one(&output, dst, PAGE_SIZE);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
+	int err, dlen;
+	void *dst;
 
 	/*
-	 * it maybe looks a little bit silly that we send an asynchronous request,
-	 * then wait for its completion synchronously. This makes the process look
-	 * synchronous in fact.
-	 * Theoretically, acomp supports users send multiple acomp requests in one
-	 * acomp instance, then get those requests done simultaneously. but in this
-	 * case, zswap actually does store and load page by page, there is no
-	 * existing method to send the second page before the first page is done
-	 * in one thread doing zswap.
-	 * but in different threads running on different cpu, we have different
-	 * acomp instance, so multiple threads can do (de)compression in parallel.
+	 * Locking the acomp_ctx mutex once per store batch results in better
+	 * performance as compared to locking per compress batch.
 	 */
-	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
-	dlen = acomp_ctx->req->dlen;
+	mutex_lock(&acomp_ctx->mutex);
 
 	/*
-	 * If a page cannot be compressed into a size smaller than PAGE_SIZE,
-	 * save the content as is without a compression, to keep the LRU order
-	 * of writebacks.  If writeback is disabled, reject the page since it
-	 * only adds metadata overhead.  swap_writeout() will put the page back
-	 * to the active LRU list in the case.
+	 * Compress the @nr_pages in @folio starting at index @folio_start
+	 * in batches of @nr_batch_pages.
 	 */
-	if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
-		if (!wb_enabled) {
-			comp_ret = comp_ret ? comp_ret : -EINVAL;
-			goto unlock;
-		}
-		comp_ret = 0;
-		dlen = PAGE_SIZE;
-		dst = kmap_local_page(page);
-		mapped = true;
-	}
+	for (batch_start = 0; batch_start < nr_pages;
+	     batch_start += nr_batch_pages) {
+		/*
+		 * Send @nr_batch_pages to crypto_acomp for compression:
+		 *
+		 * These pages are in @folio's range of indices in the interval
+		 *     [@folio_start + @batch_start,
+		 *      @folio_start + @batch_start + @nr_batch_pages).
+		 *
+		 * @slen indicates the total source length bytes for @nr_batch_pages.
+		 *
+		 * The pool's compressor batch size is at least @nr_batch_pages,
+		 * hence the acomp_ctx has at least @nr_batch_pages dst @buffers.
+		 */
+		acomp_request_set_src_folio(acomp_ctx->req, folio,
+					    (folio_start + batch_start) * PAGE_SIZE,
+					    slen);
+
+		acomp_ctx->sg_table->sgl->length = slen;
+
+		acomp_request_set_dst_sg(acomp_ctx->req,
+					 acomp_ctx->sg_table->sgl,
+					 slen);
+
+		err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
+				      &acomp_ctx->wait);
+
+		/*
+		 * If a page cannot be compressed into a size smaller than
+		 * PAGE_SIZE, save the content as is without a compression, to
+		 * keep the LRU order of writebacks.  If writeback is disabled,
+		 * reject the page since it only adds metadata overhead.
+		 * swap_writeout() will put the page back to the active LRU list
+		 * in the case.
+		 *
+		 * It is assumed that any compressor that sets the output length
+		 * to 0 or a value >= PAGE_SIZE will also return a negative
+		 * error status in @err; i.e, will not return a successful
+		 * compression status in @err in this case.
+		 */
+		if (unlikely(err && !wb_enabled))
+			goto compress_error;
+
+		for_each_sg(acomp_ctx->sg_table->sgl, sg, nr_batch_pages,
+			    compr_batch_size_iter) {
+			batch_iter = batch_start + compr_batch_size_iter;
+			dst = acomp_ctx->buffers[compr_batch_size_iter];
+			dlen = sg->length;
+
+			if (dlen < 0) {
+				dlen = PAGE_SIZE;
+				dst = kmap_local_page(folio_page(folio,
+						      folio_start + batch_iter));
+			}
+
+			handle = zs_malloc(zs_pool, dlen, gfp, nid);
+
+			if (unlikely(IS_ERR_VALUE(handle))) {
+				if (PTR_ERR((void *)handle) == -ENOSPC)
+					zswap_reject_compress_poor++;
+				else
+					zswap_reject_alloc_fail++;
 
-	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
-	handle = zs_malloc(pool->zs_pool, dlen, gfp, page_to_nid(page));
-	if (IS_ERR_VALUE(handle)) {
-		alloc_ret = PTR_ERR((void *)handle);
-		goto unlock;
+				goto err_unlock;
+			}
+
+			zs_obj_write(zs_pool, handle, dst, dlen);
+			entries[batch_iter]->handle = handle;
+			entries[batch_iter]->length = dlen;
+			if (dst != acomp_ctx->buffers[compr_batch_size_iter])
+				kunmap_local(dst);
+		}
 	}
 
-	zs_obj_write(pool->zs_pool, handle, dst, dlen);
-	entry->handle = handle;
-	entry->length = dlen;
+	mutex_unlock(&acomp_ctx->mutex);
+	return true;
 
-unlock:
-	if (mapped)
-		kunmap_local(dst);
-	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
-		zswap_reject_compress_poor++;
-	else if (comp_ret)
-		zswap_reject_compress_fail++;
-	else if (alloc_ret)
-		zswap_reject_alloc_fail++;
+compress_error:
+	for_each_sg(acomp_ctx->sg_table->sgl, sg, nr_batch_pages,
+		    compr_batch_size_iter) {
+		if ((int)sg->length < 0) {
+			if ((int)sg->length == -ENOSPC)
+				zswap_reject_compress_poor++;
+			else
+				zswap_reject_compress_fail++;
+		}
+	}
 
+err_unlock:
 	mutex_unlock(&acomp_ctx->mutex);
-	return comp_ret == 0 && alloc_ret == 0;
+	return false;
 }
 
 static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
@@ -1499,12 +1615,16 @@ static bool zswap_store_pages(struct folio *folio,
 		INIT_LIST_HEAD(&entries[i]->lru);
 	}
 
-	for (i = 0; i < nr_pages; ++i) {
-		struct page *page = folio_page(folio, start + i);
-
-		if (!zswap_compress(page, entries[i], pool, wb_enabled))
-			goto store_pages_failed;
-	}
+	if (unlikely(!zswap_compress(folio,
+				     start,
+				     nr_pages,
+				     min(nr_pages, pool->compr_batch_size),
+				     entries,
+				     pool->zs_pool,
+				     acomp_ctx,
+				     nid,
+				     wb_enabled)))
+		goto store_pages_failed;
 
 	for (i = 0; i < nr_pages; ++i) {
 		struct zswap_entry *old, *entry = entries[i];
-- 
2.27.0