linux-kernel - Re: [PATCH v14 26/26] mm: zswap: Batched zswap_compress() for compress batching of large folios.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=ONeMBRwr+4mJt76+zWZ4dXL+LCEAMELYeT6Nx-hej2-g@mail.gmail.com>
Date: Tue, 3 Feb 2026 16:30:48 -0800
From: Nhat Pham <nphamcs@...il.com>
To: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, hannes@...xchg.org, 
	yosry.ahmed@...ux.dev, chengming.zhou@...ux.dev, usamaarif642@...il.com, 
	ryan.roberts@....com, 21cnbao@...il.com, ying.huang@...ux.alibaba.com, 
	akpm@...ux-foundation.org, senozhatsky@...omium.org, sj@...nel.org, 
	kasong@...cent.com, linux-crypto@...r.kernel.org, herbert@...dor.apana.org.au, 
	davem@...emloft.net, clabbe@...libre.com, ardb@...nel.org, 
	ebiggers@...gle.com, surenb@...gle.com, kristen.c.accardi@...el.com, 
	vinicius.gomes@...el.com, giovanni.cabiddu@...el.com, 
	wajdi.k.feghali@...el.com
Subject: Re: [PATCH v14 26/26] mm: zswap: Batched zswap_compress() for
 compress batching of large folios.

On Sat, Jan 24, 2026 at 7:36 PM Kanchana P Sridhar
<kanchana.p.sridhar@...el.com> wrote:
>
> We introduce a new batching implementation of zswap_compress() for
> compressors that do and do not support batching. This eliminates code
> duplication and facilitates code maintainability with the introduction
> of compress batching.
>
> The vectorized implementation of calling the earlier zswap_compress()
> sequentially, one page at a time in zswap_store_pages(), is replaced
> with this new version of zswap_compress() that accepts multiple pages to
> compress as a batch.
>
> If the compressor does not support batching, each page in the batch is
> compressed and stored sequentially. If the compressor supports batching,
> for e.g., 'deflate-iaa', the Intel IAA hardware accelerator, the batch
> is compressed in parallel in hardware.
>
> If the batch is compressed without errors, the compressed buffers for
> the batch are stored in zsmalloc. In case of compression errors, the
> current behavior based on whether the folio is enabled for zswap
> writeback, is preserved.
>
> The batched zswap_compress() incorporates Herbert's suggestion for
> SG lists to represent the batch's inputs/outputs to interface with the
> crypto API [1].
>
> Performance data:
> =================
> As suggested by Barry, this is the performance data gathered on Intel
> Sapphire Rapids with two workloads:
>
> 1) 30 usemem processes in a 150 GB memory limited cgroup, each
>    allocates 10G, i.e, effectively running at 50% memory pressure.
> 2) kernel_compilation "defconfig", 32 threads, cgroup memory limit set
>    to 1.7 GiB (50% memory pressure, since baseline memory usage is 3.4
>    GiB): data averaged across 10 runs.
>
> To keep comparisons simple, all testing was done without the
> zswap shrinker.
>
>  =========================================================================
>   IAA                 mm-unstable-1-23-2026             v14
>  =========================================================================
>     zswap compressor            deflate-iaa     deflate-iaa   IAA Batching
>                                                                   vs.
>                                                             IAA Sequential
>  =========================================================================
>  usemem30, 64K folios:
>
>     Total throughput (KB/s)       6,226,967      10,551,714       69%
>     Average throughput (KB/s)       207,565         351,723       69%
>     elapsed time (sec)                99.19           67.45      -32%
>     sys time (sec)                 2,356.19        1,580.47      -33%
>
>  usemem30, PMD folios:
>
>     Total throughput (KB/s)       6,347,201      11,315,500       78%
>     Average throughput (KB/s)       211,573         377,183       78%
>     elapsed time (sec)                88.14           63.37      -28%
>     sys time (sec)                 2,025.53        1,455.23      -28%
>
>  kernel_compilation, 64K folios:
>
>     elapsed time (sec)               100.10           98.74     -1.4%
>     sys time (sec)                   308.72          301.23       -2%
>
>  kernel_compilation, PMD folios:
>
>     elapsed time (sec)                95.29           93.44     -1.9%
>     sys time (sec)                   346.21          344.48     -0.5%
>  =========================================================================
>
>  =========================================================================
>   ZSTD                mm-unstable-1-23-2026             v14
>  =========================================================================
>     zswap compressor                   zstd            zstd     v14 ZSTD
>                                                              Improvement
>  =========================================================================
>  usemem30, 64K folios:
>
>     Total throughput (KB/s)       6,032,326       6,047,448      0.3%
>     Average throughput (KB/s)       201,077         201,581      0.3%
>     elapsed time (sec)                97.52           95.33     -2.2%
>     sys time (sec)                 2,415.40        2,328.38       -4%
>
>  usemem30, PMD folios:
>
>     Total throughput (KB/s)       6,570,404       6,623,962      0.8%
>     Average throughput (KB/s)       219,013         220,798      0.8%
>     elapsed time (sec)                89.17           88.25       -1%
>     sys time (sec)                 2,126.69        2,043.08       -4%
>
>  kernel_compilation, 64K folios:
>
>     elapsed time (sec)               100.89           99.98     -0.9%
>     sys time (sec)                   417.49          414.62     -0.7%
>
>  kernel_compilation, PMD folios:
>
>     elapsed time (sec)                98.26           97.38     -0.9%
>     sys time (sec)                   487.14          473.16     -2.9%
>  =========================================================================
>
> Architectural considerations for the zswap batching framework:
> ==============================================================
> We have designed the zswap batching framework to be
> hardware-agnostic. It has no dependencies on Intel-specific features and
> can be leveraged by any hardware accelerator or software-based
> compressor. In other words, the framework is open and inclusive by
> design.
>
> Potential future clients of the batching framework:
> ===================================================
> This patch-series demonstrates the performance benefits of compression
> batching when used in zswap_store() of large folios. Compression
> batching can be used for other use cases such as batching compression in
> zram, batch compression of different folios during reclaim, kcompressd,
> file systems, etc. Decompression batching can be used to improve
> efficiency of zswap writeback (Thanks Nhat for this idea), batching
> decompressions in zram, etc.
>
> Experiments with kernel_compilation "allmodconfig" that combine zswap
> compress batching, folio reclaim batching, and writeback batching show
> that 0 pages are written back with deflate-iaa and zstd. For comparison,
> the baselines for these compressors see 200K-800K pages written to disk.
> Reclaim batching relieves memory pressure faster than reclaiming one
> folio at a time, hence alleviates the need to scan slab memory for
> writeback.
>
> [1]: https://lore.kernel.org/all/aJ7Fk6RpNc815Ivd@gondor.apana.org.au/T/#m99aea2ce3d284e6c5a3253061d97b08c4752a798
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>
> ---
>  mm/zswap.c | 260 ++++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 190 insertions(+), 70 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 6a22add63220..399112af2c54 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -145,6 +145,7 @@ struct crypto_acomp_ctx {
>         struct acomp_req *req;
>         struct crypto_wait wait;
>         u8 **buffers;
> +       struct sg_table *sg_table;
>         struct mutex mutex;
>  };
>
> @@ -272,6 +273,11 @@ static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8 nr_buffers)
>                         kfree(acomp_ctx->buffers[i]);
>                 kfree(acomp_ctx->buffers);
>         }
> +
> +       if (acomp_ctx->sg_table) {
> +               sg_free_table(acomp_ctx->sg_table);
> +               kfree(acomp_ctx->sg_table);
> +       }
>  }
>
>  static struct zswap_pool *zswap_pool_create(char *compressor)
> @@ -834,6 +840,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>         struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
>         struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>         int nid = cpu_to_node(cpu);
> +       struct scatterlist *sg;
>         int ret = -ENOMEM;
>         u8 i;
>
> @@ -880,6 +887,22 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>                         goto fail;
>         }
>
> +       acomp_ctx->sg_table = kmalloc(sizeof(*acomp_ctx->sg_table),
> +                                       GFP_KERNEL);
> +       if (!acomp_ctx->sg_table)
> +               goto fail;
> +
> +       if (sg_alloc_table(acomp_ctx->sg_table, pool->compr_batch_size,
> +                          GFP_KERNEL))
> +               goto fail;
> +
> +       /*
> +        * Statically map the per-CPU destination buffers to the per-CPU
> +        * SG lists.
> +        */
> +       for_each_sg(acomp_ctx->sg_table->sgl, sg, pool->compr_batch_size, i)
> +               sg_set_buf(sg, acomp_ctx->buffers[i], PAGE_SIZE);
> +
>         /*
>          * if the backend of acomp is async zip, crypto_req_done() will wakeup
>          * crypto_wait_req(); if the backend of acomp is scomp, the callback
> @@ -900,84 +923,177 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>         return ret;
>  }
>
> -static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> -                          struct zswap_pool *pool, bool wb_enabled)
> +/*
> + * zswap_compress() batching implementation for sequential and batching
> + * compressors.
> + *
> + * Description:
> + * ============
> + *
> + * Compress multiple @nr_pages in @folio starting from the @folio_start index in
> + * batches of @nr_batch_pages.
> + *
> + * It is assumed that @nr_pages <= ZSWAP_MAX_BATCH_SIZE. zswap_store() makes
> + * sure of this by design and zswap_store_pages() warns if this is not true.
> + *
> + * @nr_pages can be in (1, ZSWAP_MAX_BATCH_SIZE] even if the compressor does not
> + * support batching.
> + *
> + * If @nr_batch_pages is 1, each page is processed sequentially.
> + *
> + * If @nr_batch_pages is > 1, compression batching is invoked within
> + * the algorithm's driver, except if @nr_pages is 1: if so, the driver can
> + * choose to call it's sequential/non-batching compress routine.

Hmm, I'm a bit confused by this documentation.

Why is there extra explanation about nr_batch_pages > 1 and nr_pages
== 1? That cannot happen, no?

nr_batch_pages is already determined by the time we enter
zswap_compress() (the computation is done at its callsite, and already
takes into account nr_pages, since it is the min of nr_pages, and the
compressor batch size).

I find this batching (for store), then sub-batching (for compression),
confusing, even if I understand it's to maintain/improve performance
for the software compressors... It makes indices in zswap_compress()
very convoluted.

Yosry and Johannes - any thoughts on this?

> + *
> + * In both cases, if all compressions are successful, the compressed buffers
> + * are stored in zsmalloc.
> + *
> + * Design notes for batching compressors:
> + * ======================================
> + *
> + * Traversing SG lists when @nr_batch_pages is > 1 is expensive, and
> + * impacts batching performance if repeated:
> + *   - to map destination buffers to each SG list in @acomp_ctx->sg_table.
> + *   - to initialize each output @sg->length to PAGE_SIZE.
> + *
> + * Design choices made to optimize batching with SG lists:
> + *
> + * 1) The source folio pages in the batch are directly submitted to
> + *    crypto_acomp via acomp_request_set_src_folio().
> + *
> + * 2) The per-CPU @acomp_ctx->sg_table scatterlists are statically mapped
> + *    to the per-CPU dst @buffers at pool creation time.
> + *
> + * 3) zswap_compress() sets the output SG list length to PAGE_SIZE for
> + *    non-batching compressors. The batching compressor's driver should do this
> + *    as part of iterating through the dst SG lists for batch compression setup.
> + *
> + * Considerations for non-batching and batching compressors:
> + * =========================================================
> + *
> + * For each output SG list in @acomp_ctx->req->sg_table->sgl, the @sg->length
> + * should be set to either the page's compressed length (success), or it's
> + * compression error value.
> + */
> +static bool zswap_compress(struct folio *folio,
> +                          long folio_start,
> +                          u8 nr_pages,
> +                          u8 nr_batch_pages,
> +                          struct zswap_entry *entries[],
> +                          struct zs_pool *zs_pool,
> +                          struct crypto_acomp_ctx *acomp_ctx,
> +                          int nid,
> +                          bool wb_enabled)
>  {
> -       struct crypto_acomp_ctx *acomp_ctx;
> -       struct scatterlist input, output;
> -       int comp_ret = 0, alloc_ret = 0;
> -       unsigned int dlen = PAGE_SIZE;
> +       gfp_t gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> +       unsigned int slen = nr_batch_pages * PAGE_SIZE;
> +       u8 batch_start, batch_iter, compr_batch_size_iter;
> +       struct scatterlist *sg;
>         unsigned long handle;
> -       gfp_t gfp;
> -       u8 *dst;
> -       bool mapped = false;
> -
> -       acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> -       mutex_lock(&acomp_ctx->mutex);
> -
> -       dst = acomp_ctx->buffers[0];
> -       sg_init_table(&input, 1);
> -       sg_set_page(&input, page, PAGE_SIZE, 0);
> -
> -       sg_init_one(&output, dst, PAGE_SIZE);
> -       acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
> +       int err, dlen;
> +       void *dst;
>
>         /*
> -        * it maybe looks a little bit silly that we send an asynchronous request,
> -        * then wait for its completion synchronously. This makes the process look
> -        * synchronous in fact.
> -        * Theoretically, acomp supports users send multiple acomp requests in one
> -        * acomp instance, then get those requests done simultaneously. but in this
> -        * case, zswap actually does store and load page by page, there is no
> -        * existing method to send the second page before the first page is done
> -        * in one thread doing zswap.
> -        * but in different threads running on different cpu, we have different
> -        * acomp instance, so multiple threads can do (de)compression in parallel.
> +        * Locking the acomp_ctx mutex once per store batch results in better
> +        * performance as compared to locking per compress batch.
>          */
> -       comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> -       dlen = acomp_ctx->req->dlen;
> +       mutex_lock(&acomp_ctx->mutex);
>
>         /*
> -        * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> -        * save the content as is without a compression, to keep the LRU order
> -        * of writebacks.  If writeback is disabled, reject the page since it
> -        * only adds metadata overhead.  swap_writeout() will put the page back
> -        * to the active LRU list in the case.
> +        * Compress the @nr_pages in @folio starting at index @folio_start
> +        * in batches of @nr_batch_pages.
>          */
> -       if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
> -               if (!wb_enabled) {
> -                       comp_ret = comp_ret ? comp_ret : -EINVAL;
> -                       goto unlock;
> -               }
> -               comp_ret = 0;
> -               dlen = PAGE_SIZE;
> -               dst = kmap_local_page(page);
> -               mapped = true;
> -       }
> +       for (batch_start = 0; batch_start < nr_pages;
> +            batch_start += nr_batch_pages) {
> +               /*
> +                * Send @nr_batch_pages to crypto_acomp for compression:
> +                *
> +                * These pages are in @folio's range of indices in the interval
> +                *     [@folio_start + @batch_start,
> +                *      @folio_start + @batch_start + @nr_batch_pages).
> +                *
> +                * @slen indicates the total source length bytes for @nr_batch_pages.
> +                *
> +                * The pool's compressor batch size is at least @nr_batch_pages,
> +                * hence the acomp_ctx has at least @nr_batch_pages dst @buffers.
> +                */
> +               acomp_request_set_src_folio(acomp_ctx->req, folio,
> +                                           (folio_start + batch_start) * PAGE_SIZE,
> +                                           slen);
> +
> +               acomp_ctx->sg_table->sgl->length = slen;
> +
> +               acomp_request_set_dst_sg(acomp_ctx->req,
> +                                        acomp_ctx->sg_table->sgl,
> +                                        slen);
> +
> +               err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> +                                     &acomp_ctx->wait);
> +
> +               /*
> +                * If a page cannot be compressed into a size smaller than
> +                * PAGE_SIZE, save the content as is without a compression, to
> +                * keep the LRU order of writebacks.  If writeback is disabled,
> +                * reject the page since it only adds metadata overhead.
> +                * swap_writeout() will put the page back to the active LRU list
> +                * in the case.
> +                *
> +                * It is assumed that any compressor that sets the output length
> +                * to 0 or a value >= PAGE_SIZE will also return a negative
> +                * error status in @err; i.e, will not return a successful
> +                * compression status in @err in this case.
> +                */
> +               if (unlikely(err && !wb_enabled))
> +                       goto compress_error;
> +
> +               for_each_sg(acomp_ctx->sg_table->sgl, sg, nr_batch_pages,
> +                           compr_batch_size_iter) {
> +                       batch_iter = batch_start + compr_batch_size_iter;
> +                       dst = acomp_ctx->buffers[compr_batch_size_iter];
> +                       dlen = sg->length;
> +
> +                       if (dlen < 0) {
> +                               dlen = PAGE_SIZE;
> +                               dst = kmap_local_page(folio_page(folio,
> +                                                     folio_start + batch_iter));
> +                       }
> +
> +                       handle = zs_malloc(zs_pool, dlen, gfp, nid);
> +
> +                       if (unlikely(IS_ERR_VALUE(handle))) {
> +                               if (PTR_ERR((void *)handle) == -ENOSPC)
> +                                       zswap_reject_compress_poor++;
> +                               else
> +                                       zswap_reject_alloc_fail++;
>
> -       gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> -       handle = zs_malloc(pool->zs_pool, dlen, gfp, page_to_nid(page));
> -       if (IS_ERR_VALUE(handle)) {
> -               alloc_ret = PTR_ERR((void *)handle);
> -               goto unlock;
> +                               goto err_unlock;
> +                       }
> +
> +                       zs_obj_write(zs_pool, handle, dst, dlen);
> +                       entries[batch_iter]->handle = handle;
> +                       entries[batch_iter]->length = dlen;
> +                       if (dst != acomp_ctx->buffers[compr_batch_size_iter])
> +                               kunmap_local(dst);
> +               }
>         }
>
> -       zs_obj_write(pool->zs_pool, handle, dst, dlen);
> -       entry->handle = handle;
> -       entry->length = dlen;
> +       mutex_unlock(&acomp_ctx->mutex);
> +       return true;
>
> -unlock:
> -       if (mapped)
> -               kunmap_local(dst);
> -       if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> -               zswap_reject_compress_poor++;
> -       else if (comp_ret)
> -               zswap_reject_compress_fail++;
> -       else if (alloc_ret)
> -               zswap_reject_alloc_fail++;
> +compress_error:
> +       for_each_sg(acomp_ctx->sg_table->sgl, sg, nr_batch_pages,
> +                   compr_batch_size_iter) {
> +               if ((int)sg->length < 0) {
> +                       if ((int)sg->length == -ENOSPC)
> +                               zswap_reject_compress_poor++;
> +                       else
> +                               zswap_reject_compress_fail++;
> +               }
> +       }
>
> +err_unlock:
>         mutex_unlock(&acomp_ctx->mutex);
> -       return comp_ret == 0 && alloc_ret == 0;
> +       return false;
>  }
>
>  static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> @@ -1499,12 +1615,16 @@ static bool zswap_store_pages(struct folio *folio,
>                 INIT_LIST_HEAD(&entries[i]->lru);
>         }
>
> -       for (i = 0; i < nr_pages; ++i) {
> -               struct page *page = folio_page(folio, start + i);
> -
> -               if (!zswap_compress(page, entries[i], pool, wb_enabled))
> -                       goto store_pages_failed;
> -       }
> +       if (unlikely(!zswap_compress(folio,
> +                                    start,
> +                                    nr_pages,
> +                                    min(nr_pages, pool->compr_batch_size),

Hmm this is a bit confusing. There seems to be multiples kinds of "batch size".

Am I understanding this correctly:

zswap_store(folio)
    -> zswap_store_pages() - handle a batch of nr_pages from start to
end (exclusive)
        -> zswap_compress() - compress a batch of
min(compr_batch_size, nr_pages)

where:
* compr_batch_size is the batch size prescribed by compressor (1 for
zstd, potentially more for IAA).
* nr_pages is the "store batch size", which can be more than 1, even
for zstd (to take advantage of cache locality in zswap_store_pages).

> +                                    entries,
> +                                    pool->zs_pool,
> +                                    acomp_ctx,
> +                                    nid,
> +                                    wb_enabled)))
> +               goto store_pages_failed;
>
>         for (i = 0; i < nr_pages; ++i) {
>                 struct zswap_entry *old, *entry = entries[i];
> --
> 2.27.0
>

The rest looks OK to me, but 80% of this patch is using the new crypto
API, so I'll wait for Herbert's Acked on the first half of the patch
series :)