[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <PH7PR11MB812143269E98B00ED4E5BE4DC93DA@PH7PR11MB8121.namprd11.prod.outlook.com>
Date: Fri, 22 Aug 2025 19:26:34 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Herbert Xu <herbert@...dor.apana.org.au>, Nhat Pham <nphamcs@...il.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
<hannes@...xchg.org>, "yosry.ahmed@...ux.dev" <yosry.ahmed@...ux.dev>,
"chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com"
<ryan.roberts@....com>, "21cnbao@...il.com" <21cnbao@...il.com>,
"ying.huang@...ux.alibaba.com" <ying.huang@...ux.alibaba.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"senozhatsky@...omium.org" <senozhatsky@...omium.org>,
"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
"davem@...emloft.net" <davem@...emloft.net>, "clabbe@...libre.com"
<clabbe@...libre.com>, "ardb@...nel.org" <ardb@...nel.org>,
"ebiggers@...gle.com" <ebiggers@...gle.com>, "surenb@...gle.com"
<surenb@...gle.com>, "Accardi, Kristen C" <kristen.c.accardi@...el.com>,
"Gomes, Vinicius" <vinicius.gomes@...el.com>, "Feghali, Wajdi K"
<wajdi.k.feghali@...el.com>, "Gopal, Vinodh" <vinodh.gopal@...el.com>,
"Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v11 00/24] zswap compression batching with optimized
iaa_crypto driver
> -----Original Message-----
> From: Herbert Xu <herbert@...dor.apana.org.au>
> Sent: Thursday, August 14, 2025 10:28 PM
> To: Nhat Pham <nphamcs@...il.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>; linux-
> kernel@...r.kernel.org; linux-mm@...ck.org; hannes@...xchg.org;
> yosry.ahmed@...ux.dev; chengming.zhou@...ux.dev;
> usamaarif642@...il.com; ryan.roberts@....com; 21cnbao@...il.com;
> ying.huang@...ux.alibaba.com; akpm@...ux-foundation.org;
> senozhatsky@...omium.org; linux-crypto@...r.kernel.org;
> davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> <kristen.c.accardi@...el.com>; Gomes, Vinicius <vinicius.gomes@...el.com>;
> Feghali, Wajdi K <wajdi.k.feghali@...el.com>; Gopal, Vinodh
> <vinodh.gopal@...el.com>
> Subject: Re: [PATCH v11 00/24] zswap compression batching with optimized
> iaa_crypto driver
>
> On Fri, Aug 08, 2025 at 04:51:14PM -0700, Nhat Pham wrote:
> >
> > Can we get some comments from crypto tree maintainers as well? I feel
> > like this patch series is more crypto patch than zswap patch, at this
> > point.
> >
> > Can we land any zswap parts without the crypto API change? Grasping at
> > straws here, in case we can parallelize the reviewing and merging
> > process.
>
> My preference is for a unified interface that caters to both
> software compression as well as parallel hardware compression.
>
> The reason is that there is clear advantage in passing a large
> batch of pages to the Crypto API even for software compression,
> the least we could do is to pack the compressed result together
> and avoid the unnecessary copying of the compressed output that
> is currently done in zswap.
>
> However, since you guys are both happy with this patch-set,
> I'm not going stand in the way.
>
> But I do want some changes made to the proposed Crypto API interface
> so that it can be reused for IPComp.
>
> In particular, instead of passing an opaque pointer (kernel_data)
> to magically turn on batching, please add a new helper that enables
> batching.
>
> I don't think we need any extra fields in struct acomp_req apart
> from a new field called unit_size. This would be 4096 for zswap,
> it could be the MTU for IPsec.
>
> So add something like this and document that it must be called
> after acmop_request_set_callback (which should set unit_size to 0):
>
> static inline void acomp_request_set_unit_size(struct acomp_req *req,
> unsigned int du)
> {
> req->unit = du;
> }
>
> static inline void acomp_request_set_callback(struct acomp_req *req, ...)
> {
> ...
> + req->unit = 0;
> }
>
> For the source, nothing needs to be done because the folio could
> be passed in as is.
>
> For the destination, construct an SG list for them and pass that in.
> The rule should be that the SG list must contain a sufficient number
> of pages for the compression output based on the given unit size.
>
> For the output lengths, just set the lengths in the destination
> SG list after compression. If a page is incompressible (including
> an error), just set the length to a negative value (-ENOSPC could
> be used for incompressible input, as we already do). Even though
> struct scatterlist->length is unsigned, there should be no issue
> with storing a negative value there.
Hi Herbert, Nhat,
Thanks Herbert for these suggestions! I have implemented the new crypto API
and the SG list suggestion. While doing so, I was also able to consolidate the
new scatterlist based zswap_compress() implementation for software and hardware
(i.e. batching) compressors, within the constraints of not changing anything
below the crypto layer for software compressors.
I wanted to provide some additional details so that you can review the overall
approach and let me know if things look Ok. I will rebase the code to the latest
mm-unstable and start working on v12 in the meantime.
1) The zswap per-CPU acomp_ctx has two sg_tables added, one each for
inputs/outputs, with nents set to the pool->compr_batch_size (1 for software
compressors). This per-CPU data incurs additional memory overhead per-CPU,
however this is memory that will anyway be allocated on the stack in
zswap_compress(); and less memory overhead than the latter because we know
exactly how many sg_table scatterlists to allocate for the given pool
(assuming we don't kmalloc in zswap_compress()). I will make sure to quantify
the overhead in v12's commit logs.
2) I added new sg_alloc_table_node() and sg_kmalloc_node() to facilitate this.
3) I added the acomp_request_set_unit_size() helper for
batching; initialized the unit_size to 0 in
acomp_request_set_callback(). zswap_cpu_comp_prepare() will set the unit_size
to PAGE_SIZE after the call to acomp_request_set_callback().
4) Unified code in zswap_compress() for software and hardware compressors to use
the per-CPU SG lists. Some unavoidable changes for software path to use the
acomp_req->dlen instead of the SG list output length.
5) A trade-off I had to make in the iaa_crypto driver to adhere to the new SG
list based batching architecture:
Currently, all calls to dma_map_sg() in iaa_crypto_main.c use
sg_nents(req->src[dst]). This requires the sg_init_marker() is set correctly
based on the number of pages in the batch. This in turn requires
sg_unmark_end() to be called to clear the termination marker before
returning. All this adds latency to zswap_compress() (i.e. per batch compress
call) with the new approach and causes regression wrt v11.
To make the new approach functional and performant, I have changed all
the calls to dma_map_sg() to use nents of 1. This should not be a concern,
since it eliminates redundant computes to scan an SG list with only one
scatterlist for existing kernel users, i.e. zswap with the zswap_compress()
modifications described in (4). This will continue to hold true with the zram
IAA batching support I am developing. There are no kernel use cases for the
iaa_crypto driver that will break this assumption.
6) "For the source, nothing needs to be done because the folio could be passed
in as is.". As far as I know, this cannot be accomplished without
modifications to the crypto API for software compressors, because compressed
buffers need to be stored in the zswap/zram zs_pools at PAGE_SIZE
granularity.
I have validated the above v12 changes applied over the v11 patch-series,
on Sapphire Rapids:
1) usemem30: Both zstd and deflate-iaa have comparable performance to v11.
2) kernel_compilation test: Mostly better performance than baseline, but worse
than v11. Slightly worse sys time than baseline for zstd/PMD.
usemem30 with 64K folios:
=========================
zswap shrinker_enabled = N.
-----------------------------------------------------------------------
mm-unstable-7-30-2025 v11 v12
-----------------------------------------------------------------------
zswap compressor deflate-iaa deflate-iaa deflate-iaa
-----------------------------------------------------------------------
Total throughput (KB/s) 7,153,359 10,856,388 10,714,236
Avg throughput (KB/s) 238,445 361,879 357,141
elapsed time (sec) 92.61 70.50 68.87
sys time (sec) 2,193.59 1,675.32 1,613.11
-----------------------------------------------------------------------
-----------------------------------------------------------------------
mm-unstable-7-30-2025 v11 v12
-----------------------------------------------------------------------
zswap compressor zstd zstd zstd
-----------------------------------------------------------------------
Total throughput (KB/s) 6,866,411 6,874,244 6,922,818
Avg throughput (KB/s) 228,880 229,141 230,760
elapsed time (sec) 96.45 89.05 87.75
sys time (sec) 2,410.72 2,150.63 2,090.86
-----------------------------------------------------------------------
usemem30 with 2M folios:
========================
zswap shrinker_enabled = N.
-----------------------------------------------------------------------
mm-unstable-7-30-2025 v11 v12
-----------------------------------------------------------------------
zswap compressor deflate-iaa deflate-iaa deflate-iaa
-----------------------------------------------------------------------
Total throughput (KB/s) 7,268,929 11,312,195 10,943,491
Avg throughput (KB/s) 242,297 377,073 364,783
elapsed time (sec) 80.40 68.73 69.19
sys time (sec) 1,856.54 1,599.25 1,618.08
-----------------------------------------------------------------------
-----------------------------------------------------------------------
mm-unstable-7-30-2025 v11 v12
-----------------------------------------------------------------------
zswap compressor zstd zstd zstd
-----------------------------------------------------------------------
Total throughput (KB/s) 7,560,441 7,627,155 7,600,588
Avg throughput (KB/s) 252,014 254,238 253,352
elapsed time (sec) 88.89 83.22 87.55
sys time (sec) 2,132.05 1,952.98 2,079.26
-----------------------------------------------------------------------
kernel_compilation with 64K folios:
===================================
zswap shrinker_enabled = Y.
--------------------------------------------------------------------------
mm-unstable-7-30-2025 v11 v12
--------------------------------------------------------------------------
zswap compressor deflate-iaa deflate-iaa deflate-iaa
--------------------------------------------------------------------------
real_sec 901.81 840.60 895.94
sys_sec 2,672.93 2,171.17 2,584.04
zswpout 34,700,692 24,076,095 37,725,671
zswap_written_back_pages 2,612,474 1,451,961 3,050,557
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mm-unstable-7-30-2025 v11 v12
--------------------------------------------------------------------------
zswap compressor zstd zstd zstd
--------------------------------------------------------------------------
real_sec 882.67 837.21 872.98
sys_sec 3,573.31 2,593.94 3,301.67
zswpout 42,768,967 22,660,215 36,810,396
zswap_written_back_pages 2,109,739 727,919 1,475,480
--------------------------------------------------------------------------
kernel_compilation with PMD folios:
===================================
zswap shrinker_enabled = Y.
--------------------------------------------------------------------------
mm-unstable-7-30-2025 v11 v12
--------------------------------------------------------------------------
zswap compressor deflate-iaa deflate-iaa deflate-iaa
--------------------------------------------------------------------------
real_sec 838.76 804.83 826.05
sys_sec 3,173.57 2,422.63 3,128.11
zswpout 59,544,198 38,093,995 60,072,119
zswap_written_back_pages 2,726,367 929,614 2,324,707
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mm-unstable-7-30-2025 v11 v12
--------------------------------------------------------------------------
zswap compressor zstd zstd zstd
--------------------------------------------------------------------------
real_sec 831.09 813.40 827.84
sys_sec 4,251.11 3,053.95 4,406.65
zswpout 59,452,638 35,832,407 63,459,471
zswap_written_back_pages 1,041,721 423,334 1,162,913
--------------------------------------------------------------------------
I am still in the process of verifying if modifying zswap_decompress() to use
the per-CPU SG lists improves kernel_compilation, but thought this would be a
good sync point to get your thoughts.
I would greatly appreciate your comments on the approach and trade-offs, and
guidance on how to proceed.
"v12" zswap.c diff wrt v11:
===========================
diff --git a/mm/zswap.c b/mm/zswap.c
index c30c1f325f57..58ad257e87e8 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -152,6 +152,8 @@ struct crypto_acomp_ctx {
struct acomp_req *req;
struct crypto_wait wait;
u8 **buffers;
+ struct sg_table *sg_inputs;
+ struct sg_table *sg_outputs;
struct mutex mutex;
bool is_sleepable;
};
@@ -282,6 +284,16 @@ static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8 nr_buffers)
kfree(acomp_ctx->buffers[i]);
kfree(acomp_ctx->buffers);
}
+
+ if (acomp_ctx->sg_inputs) {
+ sg_free_table(acomp_ctx->sg_inputs);
+ acomp_ctx->sg_inputs = NULL;
+ }
+
+ if (acomp_ctx->sg_outputs) {
+ sg_free_table(acomp_ctx->sg_outputs);
+ acomp_ctx->sg_outputs = NULL;
+ }
}
static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
@@ -922,6 +934,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
{
struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
+ int cpu_node = cpu_to_node(cpu);
int ret = -ENOMEM;
u8 i;
@@ -936,7 +949,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
return 0;
- acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
+ acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_node);
if (IS_ERR_OR_NULL(acomp_ctx->acomp)) {
pr_err("could not alloc crypto acomp %s : %ld\n",
pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
@@ -960,13 +973,13 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
crypto_acomp_batch_size(acomp_ctx->acomp));
acomp_ctx->buffers = kcalloc_node(pool->compr_batch_size, sizeof(u8 *),
- GFP_KERNEL, cpu_to_node(cpu));
+ GFP_KERNEL, cpu_node);
if (!acomp_ctx->buffers)
goto fail;
for (i = 0; i < pool->compr_batch_size; ++i) {
acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
- cpu_to_node(cpu));
+ cpu_node);
if (!acomp_ctx->buffers[i])
goto fail;
}
@@ -981,6 +994,26 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
crypto_req_done, &acomp_ctx->wait);
+ acomp_request_set_unit_size(acomp_ctx->req, PAGE_SIZE);
+
+ acomp_ctx->sg_inputs = kmalloc_node(sizeof(*acomp_ctx->sg_inputs),
+ GFP_KERNEL, cpu_node);
+ if (!acomp_ctx->sg_inputs)
+ goto fail;
+
+ if (sg_alloc_table_node(&acomp_ctx->sg_inputs, pool->compr_batch_size,
+ GFP_KERNEL, cpu_node))
+ goto fail;
+
+ acomp_ctx->sg_outputs = kmalloc_node(sizeof(*acomp_ctx->sg_outputs),
+ GFP_KERNEL, cpu_node);
+ if (!acomp_ctx->sg_outputs)
+ goto fail;
+
+ if (sg_alloc_table_node(&acomp_ctx->sg_outputs, pool->compr_batch_size,
+ GFP_KERNEL, cpu_node))
+ goto fail;
+
mutex_init(&acomp_ctx->mutex);
return 0;
@@ -1027,17 +1060,14 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
struct zswap_entry *entries[], struct zswap_pool *pool,
int node_id)
{
+ unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
+ unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
struct crypto_acomp_ctx *acomp_ctx;
- struct scatterlist input, output;
struct zpool *zpool = pool->zpool;
-
- unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
- int errors[ZSWAP_MAX_BATCH_SIZE];
-
- unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
- unsigned int i, j;
- int err;
+ struct scatterlist *sg;
+ unsigned int i, j, k;
gfp_t gfp;
+ int err;
gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
@@ -1045,59 +1075,58 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
mutex_lock(&acomp_ctx->mutex);
+ prefetchw(acomp_ctx->sg_inputs->sgl);
+ prefetchw(acomp_ctx->sg_outputs->sgl);
+
/*
* Note:
* [i] refers to the incoming batch space and is used to
- * index into the folio pages, @entries and @errors.
+ * index into the folio pages and @entries.
+ *
+ * [k] refers to the @acomp_ctx space, as determined by
+ * @pool->compr_batch_size, and is used to index into
+ * @acomp_ctx->buffers and @dlens.
*/
for (i = 0; i < nr_pages; i += nr_comps) {
- if (nr_comps == 1) {
- sg_init_table(&input, 1);
- sg_set_page(&input, folio_page(folio, start + i), PAGE_SIZE, 0);
+ for_each_sg(acomp_ctx->sg_inputs->sgl, sg, nr_comps, k)
+ sg_set_page(sg, folio_page(folio, start + k + i), PAGE_SIZE, 0);
- /*
- * We need PAGE_SIZE * 2 here since there maybe over-compression case,
- * and hardware-accelerators may won't check the dst buffer size, so
- * giving the dst buffer with enough length to avoid buffer overflow.
- */
- sg_init_one(&output, acomp_ctx->buffers[0], PAGE_SIZE * 2);
- acomp_request_set_params(acomp_ctx->req, &input,
- &output, PAGE_SIZE, PAGE_SIZE);
-
- errors[i] = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
- &acomp_ctx->wait);
- if (unlikely(errors[i]))
- goto compress_error;
-
- dlens[i] = acomp_ctx->req->dlen;
- } else {
- struct page *pages[ZSWAP_MAX_BATCH_SIZE];
- unsigned int k;
-
- for (k = 0; k < nr_pages; ++k)
- pages[k] = folio_page(folio, start + k);
-
- struct swap_batch_comp_data batch_comp_data = {
- .pages = pages,
- .dsts = acomp_ctx->buffers,
- .dlens = dlens,
- .errors = errors,
- .nr_comps = nr_pages,
- };
-
- acomp_ctx->req->kernel_data = &batch_comp_data;
-
- if (unlikely(crypto_acomp_compress(acomp_ctx->req)))
- goto compress_error;
+ /*
+ * We need PAGE_SIZE * 2 here since there maybe over-compression case,
+ * and hardware-accelerators may won't check the dst buffer size, so
+ * giving the dst buffer with enough length to avoid buffer overflow.
+ */
+ for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k)
+ sg_set_buf(sg, acomp_ctx->buffers[k], PAGE_SIZE * 2);
+
+ acomp_request_set_params(acomp_ctx->req,
+ acomp_ctx->sg_inputs->sgl,
+ acomp_ctx->sg_outputs->sgl,
+ nr_comps * PAGE_SIZE,
+ nr_comps * PAGE_SIZE);
+
+ err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
+ &acomp_ctx->wait);
+
+ if (unlikely(err)) {
+ if (nr_comps == 1)
+ dlens[0] = err;
+ goto compress_error;
}
+ if (nr_comps == 1)
+ dlens[0] = acomp_ctx->req->dlen;
+ else
+ for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k)
+ dlens[k] = sg->length;
+
/*
* All @nr_comps pages were successfully compressed.
* Store the pages in zpool.
*
* Note:
* [j] refers to the incoming batch space and is used to
- * index into the folio pages, @entries, @dlens and @errors.
+ * index into the folio pages and @entries.
* [k] refers to the @acomp_ctx space, as determined by
* @pool->compr_batch_size, and is used to index into
* @acomp_ctx->buffers.
@@ -1113,7 +1142,7 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
* non-batching software compressors.
*/
prefetchw(entries[j]);
- err = zpool_malloc(zpool, dlens[j], gfp, &handle, node_id);
+ err = zpool_malloc(zpool, dlens[k], gfp, &handle, node_id);
if (unlikely(err)) {
if (err == -ENOSPC)
@@ -1124,9 +1153,9 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
goto err_unlock;
}
- zpool_obj_write(zpool, handle, acomp_ctx->buffers[k], dlens[j]);
+ zpool_obj_write(zpool, handle, acomp_ctx->buffers[k], dlens[k]);
entries[j]->handle = handle;
- entries[j]->length = dlens[j];
+ entries[j]->length = dlens[k];
}
} /* finished compress and store nr_pages. */
@@ -1134,9 +1163,9 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
return true;
compress_error:
- for (j = i; j < i + nr_comps; ++j) {
- if (errors[j]) {
- if (errors[j] == -ENOSPC)
+ for (k = 0; k < nr_comps; ++k) {
+ if (dlens[k] < 0) {
+ if (dlens[k] == -ENOSPC)
zswap_reject_compress_poor++;
else
zswap_reject_compress_fail++;
Thanks,
Kanchana
>
> Cheers,
> --
> Email: Herbert Xu <herbert@...dor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Powered by blists - more mailing lists