lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251104091235.8793-1-kanchana.p.sridhar@intel.com>
Date: Tue,  4 Nov 2025 01:12:13 -0800
From: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>
To: linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	hannes@...xchg.org,
	yosry.ahmed@...ux.dev,
	nphamcs@...il.com,
	chengming.zhou@...ux.dev,
	usamaarif642@...il.com,
	ryan.roberts@....com,
	21cnbao@...il.com,
	ying.huang@...ux.alibaba.com,
	akpm@...ux-foundation.org,
	senozhatsky@...omium.org,
	sj@...nel.org,
	kasong@...cent.com,
	linux-crypto@...r.kernel.org,
	herbert@...dor.apana.org.au,
	davem@...emloft.net,
	clabbe@...libre.com,
	ardb@...nel.org,
	ebiggers@...gle.com,
	surenb@...gle.com,
	kristen.c.accardi@...el.com,
	vinicius.gomes@...el.com
Cc: wajdi.k.feghali@...el.com,
	vinodh.gopal@...el.com,
	kanchana.p.sridhar@...el.com
Subject: [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver

v13: zswap compression batching with optimized iaa_crypto driver
================================================================
This updated patch-series further generalizes the batching implementation of
zswap_compress() for non-batching and batching compressors. It makes sure the
bulk allocation of zswap entries preserves the current behavior of addition of
an entry to the LRU list for the nid of the page.

Based on Herbert's suggestions, the batching interfaces from zswap to crypto,
from crypto to iaa_crypto, and the batching implementation within iaa_crypto now
use the folio directly as the source (sg_page_iter for retrieving pages), and
destination SG lists. A unit_size has been added to struct acomp_req, with
kernel users such as zswap using the new acomp_request_set_unit_size() API to
set the unit size to use while breaking down the request's src/dst
scatterlists. zswap sets the unit-size to PAGE_SIZE.

Following Andrew's suggestion, the next two paragraphs emphasize generality and
alignment with current kernel efforts.

Architectural considerations for the zswap batching framework:
==============================================================
We have designed the zswap batching framework to be hardware-agnostic. It has no
dependencies on Intel-specific features and can be leveraged by any hardware
accelerator or software-based compressor. In other words, the framework is open
and inclusive by design.

Other ongoing work that can use batching:
=========================================
This patch-series demonstrates the performance benefits of compress
batching when used in zswap_store() of large folios. shrink_folio_list()
"reclaim batching" of any-order folios is the next major work that uses
this zswap compress batching framework: our testing of kernel_compilation
with writeback and the zswap shrinker indicates 10X fewer pages get
written back when we reclaim 32 folios as a batch, as compared to one
folio at a time: this is with deflate-iaa and with zstd. We expect to
submit a patch-series with this data and the resulting performance
improvements shortly. Reclaim batching relieves memory pressure faster
than reclaiming one folio at a time, hence alleviates the need to scan
slab memory for writeback.

Many thanks to Nhat for suggesting ideas on using batching with the
ongoing kcompressd work, as well as beneficially using decompression
batching & block IO batching to improve zswap writeback efficiency.

Experiments with kernel compilation benchmark (allmod config) that
combine zswap compress batching, reclaim batching, swapin_readahead()
decompression batching of prefetched pages, and writeback batching show
that 0 pages are written back to disk with deflate-iaa and zstd. For
comparison, the baselines for these compressors see 200K-800K pages
written to disk.

To summarize, these are future clients of the batching framework:

   - shrink_folio_list() reclaim batching of multiple folios:
       Implemented, will submit patch-series.
   - zswap writeback with decompress batching:
       Implemented, will submit patch-series.
   - zram:
       Implemented, will submit patch-series.
   - kcompressd:
       Not yet implemented.
   - file systems:
       Not yet implemented.
   - swapin_readahead() decompression batching of prefetched pages:
       Implemented, will submit patch-series.


iaa_crypto Driver Rearchitecting and Optimizations:
===================================================

The most significant highlight of v13 is a new, lightweight and highly
optimized iaa_crypto driver, resulting directly in the latency and
throughput improvements noted later in this cover letter.

 1) Better stability, more functionally versatile to support zswap
    with better performance on different Intel platforms.

    a) Patches 0002, 0005 and 0011 together resolve a race condition in
       mainline v6.15, reported from internal validation, when IAA
       wqs/devices are disabled while workloads are using IAA.

    b) Patch 0002 introduces a new architecture for mapping cores to
       IAAs based on packages instead of NUMA nodes, and generalizing
       how WQs are used: as package level shared resources for all
       same-package cores (default for compress WQs), or dedicated to
       mapped cores (default for decompress WQs). Further, users are
       able to configure multiple WQs and specify how many of those are
       for compress jobs only vs. decompress jobs only. sysfs iaa_crypto
       driver parameters can be used to change the default settings for
       performance tuning.

    c) idxd descriptor allocation moved from blocking to non-blocking
       with retry limits and mitigations if limits are exceeded.

    d) Code cleanup for readability and clearer code flow.

    e) Fixes IAA re-registration errors upon disabling/enabling IAA wqs
       and devices that exists in the mainline v6.15.

    f) Addition of a layer that encapsulates iaa_crypto's core functionality to
       rely only on idxd, dma and scatterlists to provide clean interfaces to
       crypto_acomp.

    g) New Dynamic compression mode for Granite Rapids to get better
       compression ratio by echo-ing 'deflate-iaa-dynamic' as the zswap
       compressor.

    h) New crypto_acomp API crypto_acomp_batch_size() that will return
       the driver's max batch size if the driver has registered a batch_size
       that's greater than 1; or 1 if there is no driver specific definition of
       batch_size.

       Accordingly, iaa_crypto sets the acomp_alg batch_size to its internal
       IAA_CRYPTO_MAX_BATCH_SIZE for fixed and dynamic modes.

 2) Performance optimizations (please refer to the latency data per
    optimization in the commit logs):

    a) Distributing [de]compress jobs in round-robin manner to available
       IAAs on package.

    b) Replacing the compute-intensive iaa_wq_get()/iaa_wq_put() with a
       percpu_ref in struct iaa_wq, thereby eliminating acquiring a
       spinlock in the fast path, while using a combination of the
       iaa_crypto_enabled atomic with spinlocks in the slow path to
       ensure the compress/decompress code sees a consistent state of the
       wq tables.
       
    c) Directly call movdir64b for non-irq use cases, i.e., the most
       common usage. Avoid the overhead of irq-specific computes in
       idxd_submit_desc() to gain latency.

    d) Batching of compressions/decompressions using async submit-poll
       mechanism to derive the benefits of hardware parallelism.

    e) Batching compressors need to manage their own "requests"
       abstraction, and remove this driver-specific aspect from being
       managed by kernel users such as zswap. iaa_crypto maintains
       per-CPU "struct iaa_req **reqs" to submit multiple jobs to the
       hardware accelerator to run in parallel.

    f) Modifies the iaa_crypto batching API and their implementation to expect a
       src SG list that contains the batch's pages and a dst SG list that has
       multiple scatterlists for the batch's output buffers. 

    g) Submit the two largest data buffers first for decompression
       batching, so that the longest running jobs get a head start,
       reducing latency for the batch.

 3)  Compress/decompress batching are implemented using SG lists as the batching
     interface.


Main Changes in Zswap Compression Batching:
===========================================

 Note to zswap maintainers:
 --------------------------
 Patches 19 and 20 can be reviewed and improved/merged independently
 of this series, since they are zswap centric. These 2 patches help
 batching but the crypto_acomp_batch_size() from the iaa_crypto commits
 in this series is not a requirement, unlike patches 21-22.
 
 1) v13 preserves the pool acomp_ctx resources creation/deletion
    simplification of v11, namely, lasting from pool creation-deletion,
    persisting through CPU hot[un]plug operations. Further, zswap no
    longer needs to create multiple "struct acomp_req" in the per-CPU
    acomp_ctx. zswap only needs to manage multiple "u8 **buffers".

 2) We store the compressor's batch-size (@pool->compr_batch_size) directly in
    struct zswap_pool for quick retrieval in the zswap_store() fast path.

 3) Optimizations to not cause regressions in software compressors with
    the introduction of the new unified zswap_compress() framework that
    implements compression batching for all compressors. These optimizations
    help recover the performance for non-batching compressors:

    a) kmem_cache_alloc_bulk(), kmem_cache_free_bulk() to allocate/free
       batch zswap_entry-s. These kmem_cache API allow allocator
       optimizations with internal locks for multiple allocations.

    b) The page's nid is stored in a new nid field added to zswap_entry, so the
       zswap_lru_add()/zswap_lru_del() will add/delete the entry from the LRU
       list of the page's nid. This preserves the current behavior wrt the
       shrinker.

    c) Writes to the zswap_entry right after it is allocated without
       modifying the publishing order. This avoids different code blocks
       in zswap_store_pages() having to bring the zswap_entries to the
       cache for writing, potentially evicting other working set
       structures, impacting performance.

    d) ZSWAP_MAX_BATCH_SIZE is used as the batch-size for software
       compressors, since this gives the best performance with zstd.

    e) Minimize branches in zswap_compress().

 4) During pool creation, these key additions are allocated as part of the
    per-CPU acomp_ctx so as to recover performance with the new, generalized SG
    lists based zswap_compress() batching interface:

    a) An sg_table "acomp_ctx->sg_outputs" is allocated to contain the
       compressor's batch-size number of SG lists that will contain the
       destination buffers/lengths after batch compression.

    b) The per-CPU destination buffers are mapped to the per-CPU SG lists: this
       needs to be done only once, and optimizes performance.

 5) A unified zswap_compress() API is added to compress multiple pages. Thanks
    to Nhat, Yosry and Johannes for their helpful suggestions to accomplish
    this.

 6) Finally, zswap_compress() has been re-written to incorporate Herbert's
    suggestions to use source folios and output SG lists for batching. The new
    zswap_compress() code has been made as generic to software and batching
    compressors as possible, so that it is easy to read and maintain. The
    recent changes related to PAGE_SIZE dst buffers, zsmalloc and incompressible
    pages have been incorporated into the batched zswap_compress() as well. To
    resolve regressions with zstd, I took the liberty of not explicitly checking
    for dlen == 0 and dlen > PAGE_SIZE (as in the mainline); instead,
    expecting that a negative err value will be returned by the software
    compressor in such cases.


Compression Batching:
=====================

This patch-series introduces batch compression of pages in large folios to
improve zswap swapout latency. It preserves the existing zswap protocols
for non-batching software compressors by calling crypto_acomp sequentially
per page in the batch. Additionally, in support of hardware accelerators
that can process a batch as an integral unit, the patch-series allows
zswap to call crypto_acomp without API changes, for compressors
that intrinsically support batching. The zswap_compress() code has very minimal
special casing for software/batching compressors.

The patch series provides a proof point by using the Intel Analytics
Accelerator (IAA) for implementing the compress/decompress batching API
using hardware parallelism in the iaa_crypto driver and another proof point
with a sequential software compressor, zstd.

SUMMARY:
========

  The first proof point is to test with IAA using a sequential call (fully
  synchronous, compress one page at a time) vs. a batching call (fully
  asynchronous, submit a batch to IAA for parallel compression, then poll for
  completion statuses).
  
    The performance testing data with 30 usemem processes/64K folios
    shows 62% throughput gains and 28% elapsed/sys time reductions with
    deflate-iaa; and 5% sys time reduction with zstd for a small
    throughput increase. For PMD folios, a 67% throughput gain and 23%
    elapsed/sys time reduction is seen.

    Kernel compilation test with 64K folios using 32 threads and the
    zswap shrinker_enabled set to "N", demonstrates similar
    improvements: zswap_store() large folios using IAA compress batching
    improves the workload performance by 3.5% and reduces sys time by
    6% as compared to IAA sequential. For zstd, compress batching
    improves workload performance by 3.4% and reduces sys time by
    1.8% as compared to sequentially calling zswap_compress() per page
    in a folio.

    The main takeaway from usemem, a workload that is mostly compression
    dominated (very few swapins) is that the higher the number of batches,
    such as with larger folios, the more the benefit of batching cost
    amortization, as shown by the PMD usemem data. This aligns well with the
    future direction for batching.

  The second proof point is to make sure that software algorithms such as
  zstd do not regress. The data indicates that for sequential software
  algorithms a performance gain is achieved. 
  
    With the performance optimizations implemented in patches 21-22 of v13:
    
    *  zstd usemem metrics with 64K folios are within range of variation
       with a slight sys time improvement. zstd usemem30 workload performance
       with PMD folios improves by 6% and sys time reduces by 8%, for comparable
       throughput as the baseline.

    *  With kernel compilation, I used zstd without the zswap shrinker to enable
       more direct comparisons with the changes in this series. Subsequent patch
       series I expect to submit in collaboration with Nhat, will enable the
       zswap shrinker to quantify the benefits of decompression batching during
       writeback. With this series' compression batching within large folios, we
       get a 6%-1.8% reduction in sys time, a 3.5%-3.4% improvement in workload
       performance with 64K folios for deflate-iaa/zstd respectively.

    These optimizations pertain to ensuring common code paths and removing
    redundant branches/computes. Additionally, using the batching code for
    non-batching compressors to sequentially compress/store batches of up
    to ZSWAP_MAX_BATCH_SIZE pages seems to help, most likely due to
    cache locality of working set structures such as the array of
    zswap_entry-s for the batch.
  
    Our internal validation of zstd with the batching interface vs. IAA with
    the batching interface on Emerald Rapids has shown that IAA
    compress/decompress batching gives 21.3% more memory savings as compared
    to zstd, for 5% performance loss as compared to the baseline without any
    memory pressure. IAA batching demonstrates more than 2X the memory
    savings obtained by zstd at this 95% performance KPI.
    The compression ratio with IAA is 2.23, and with zstd 2.96. Even with
    this compression ratio deficit for IAA, batching is extremely
    beneficial. As we improve the compression ratio of the IAA accelerator,
    we expect to see even better memory savings with IAA as compared to
    software compressors.
    

  Batching Roadmap:
  =================

  1) Compression batching within large folios (this series).
  
  2) zswap writeback decompression batching:

     This is being co-developed with Nhat Pham, and shows promising
     results. We plan to submit an RFC shortly.

  3) Reclaim batching of hybrid folios:
  
     We can expect to see even more significant performance and throughput
     improvements if we use the parallelism offered by IAA to do reclaim
     batching of 4K/large folios (really any-order folios), and using the
     zswap_store() high throughput compression pipeline to batch-compress
     pages comprising these folios, not just batching within large
     folios. This is the reclaim batching patch 13 in v1, which we expect
     to submit in a separate patch-series. As mentioned earlier, reclaim
     batching reduces the # of writeback pages by 10X for zstd and
     deflate-iaa.

  4) swapin_readahead() decompression batching:

     We have developed a zswap load batching interface to be used
     for parallel decompression batching, using swapin_readahead().
  
  These capabilities are architected so as to be useful to zswap and
  zram. We have integrated these components with zram and expect to submit an
  RFC soon.

 
  v13 Performance Summary:
  ========================

  This is a performance testing summary of results with usemem30
  (30 usemem processes running in a cgroup limited at 150G, each trying to
   allocate 10G).

  usemem30 with 64K folios:
  =========================
  
     zswap shrinker_enabled = N.
  
     -----------------------------------------------------------------------
                     mm-unstable-10-24-2025             v13
     -----------------------------------------------------------------------
     zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
                                                                 vs.
                                                             IAA Sequential
     -----------------------------------------------------------------------
     Total throughput (KB/s)     6,118,675       9,901,216       62%
     Average throughput (KB/s)     203,955         330,040       62%     
     elapsed time (sec)              98.94           70.90      -28%      
     sys time (sec)               2,379.29        1,686.18      -29%      
     -----------------------------------------------------------------------
    
     -----------------------------------------------------------------------
                     mm-unstable-10-24-2025             v13    
     -----------------------------------------------------------------------
     zswap compressor                 zstd            zstd   v13 zstd    
                                                             improvement  
     -----------------------------------------------------------------------
     Total throughput (KB/s)     5,983,561       6,003,851      0.3%
     Average throughput (KB/s)     199,452         200,128      0.3% 
     elapsed time (sec)             100.93           96.62     -4.3%
     sys time (sec)               2,532.49        2,395.83       -5%         
     -----------------------------------------------------------------------

  usemem30 with 2M folios:
  ========================
  
     -----------------------------------------------------------------------
                     mm-unstable-10-24-2025             v13
     -----------------------------------------------------------------------
     zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
                                                                 vs.
                                                             IAA Sequential
     -----------------------------------------------------------------------
     Total throughput (KB/s)     6,309,635      10,558,225       67%
     Average throughput (KB/s)     210,321         351,940       67% 
     elapsed time (sec)              88.70           67.84      -24%
     sys time (sec)               2,059.83        1,581.07      -23%
     -----------------------------------------------------------------------
  
     -----------------------------------------------------------------------
                     mm-unstable-10-24-2025             v13      
     -----------------------------------------------------------------------
     zswap compressor                 zstd            zstd   v13 zstd           
                                                             improvement
     -----------------------------------------------------------------------
     Total throughput (KB/s)     6,562,687       6,567,946      0.1%
     Average throughput (KB/s)     218,756         218,931      0.1% 
     elapsed time (sec)              94.69           88.79       -6%
     sys time (sec)               2,253.97        2,083.43       -8%
     -----------------------------------------------------------------------


  This is a performance testing summary of results with
  kernel_compilation test (allmod config, 32 cores, cgroup limited to 2G).

  zswap shrinker_enabled = N.
  
  kernel_compilation with 64K folios:
  ===================================

     --------------------------------------------------------------------------
               mm-unstable-10-24-2025             v13
     --------------------------------------------------------------------------
     zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
                                                             vs.
                                                        IAA Sequential
     --------------------------------------------------------------------------
     real_sec                 836.64          806.94      -3.5%
     sys_sec                3,897.57        3,661.83        -6%
     --------------------------------------------------------------------------

     --------------------------------------------------------------------------
               mm-unstable-10-24-2025             v13
     --------------------------------------------------------------------------
     zswap compressor           zstd            zstd    Improvement
     --------------------------------------------------------------------------
     real_sec                 880.62          850.41      -3.4%
     sys_sec                5,171.90        5,076.51      -1.8%
     --------------------------------------------------------------------------


  kernel_compilation with PMD folios:
  ===================================

     --------------------------------------------------------------------------
               mm-unstable-10-24-2025             v13
     --------------------------------------------------------------------------
     zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
                                                             vs.
                                                        IAA Sequential
     --------------------------------------------------------------------------
     real_sec                 818.48          779.67      -4.7%
     sys_sec                4,226.52        4,245.18       0.4%
     --------------------------------------------------------------------------
 
     --------------------------------------------------------------------------
              mm-unstable-10-24-2025             v13
     --------------------------------------------------------------------------
     zswap compressor          zstd             zstd    Improvement
     --------------------------------------------------------------------------
     real_sec                888.45           849.54      -4.4%
     sys_sec               5,866.72         5,847.17      -0.3%
     --------------------------------------------------------------------------



The patch-series is organized as follows:
=========================================

 1) crypto acomp & iaa_crypto driver enablers for batching: Relevant
    patches are tagged with "crypto:" in the subject:

    Patch 1) Reorganizes the iaa_crypto driver code into logically related
             sections and avoids forward declarations, in order to facilitate
             subsequent iaa_crypto patches. This patch makes no
             functional changes.

    Patch 2) Makes an infrastructure change in the iaa_crypto driver
             to map IAA devices/work-queues to cores based on packages
             instead of NUMA nodes. This doesn't impact performance on
             the Sapphire Rapids system used for performance
             testing. However, this change fixes functional problems we
             found on Granite Rapids during internal validation, where the
             number of NUMA nodes is greater than the number of packages,
             which was resulting in over-utilization of some IAA devices
             and non-usage of other IAA devices as per the current NUMA
             based mapping infrastructure.

             This patch also develops a new architecture that
             generalizes how IAA device WQs are used. It enables
             designating IAA device WQs as either compress-only or
             decompress-only or generic. Once IAA device WQ types are
             thus defined, it also allows the configuration of whether
             device WQs will be shared by all cores on the package, or
             used only by "mapped cores" obtained by a simple allocation
             of available IAAs to cores on the package.

             As a result of the overhaul of wq_table definition,
             allocation and rebalancing, this patch eliminates
             duplication of device WQs in per-CPU wq_tables, thereby
             saving 140MiB on a 384 cores dual socket Granite Rapids server
             with 8 IAAs.

             Regardless of how the user has configured the WQs' usage,
             the next WQ to use is obtained through a direct look-up in
             per-CPU "cpu_comp_wqs" and "cpu_decomp_wqs" structures so
             as to minimize latency in the critical path driver compress
             and decompress routines.

    Patch 3) Code cleanup, consistency of function parameters.

    Patch 4) Makes a change to iaa_crypto driver's descriptor allocation,
             from blocking to non-blocking with retries/timeouts and
             mitigations in case of timeouts during compress/decompress
             ops. This prevents tasks getting blocked indefinitely, which
             was observed when testing 30 cores running workloads, with
             only 1 IAA enabled on Sapphire Rapids (out of 4). These
             timeouts are typically only encountered, and associated
             mitigations exercised, only in configurations with 1 IAA
             device shared by 30+ cores.

    Patch 5) Optimize iaa_wq refcounts using a percpu_ref instead of
             spinlocks and "int refcount".

    Patch 6) Code simplification and restructuring for understandability
             in core iaa_compress() and iaa_decompress() routines.

    Patch 7) Refactor hardware descriptor setup to their own procedures
             to reduce code clutter.

    Patch 8) Simplify and optimize job submission for the most commonly used
             non-irq async mode by directly calling movdir64b.

    Patch 9) Deprecate exporting symbols for adding IAA compression
             modes.

    Patch 10) All dma_map_sg() calls will pass in 1 for the nents instead of
              sg_nents(), for these main reasons: performance; no existing
              iaa_crypto use cases that allow multiple SG lists to be mapped for
              dma at once; facilitates new SG lists batching interface through
              crypto.

    Patch 11) Move iaa_crypto core functionality to a layer that relies only on
              the idxd driver, dma, and scatterlists. Implement clean interfaces
              to crypto_acomp.

    Patch 12) Define a unit_size in struct acomp_req to enable batching, and
              provides acomp_request_set_unit_size() for use by kernel
              modules. zswap_cpu_comp_prepare() calls this API to set the
              unit_size for zswap as PAGE_SIZE.

    Patch 13) Implement asynchronous descriptor submit and polling mechanisms,
              enablers for batching. Develop IAA batching of compressions and
              decompressions for deriving hardware parallelism.

    Patch 14) Enables the "async" mode, sets it as the default.

    Patch 15) Disables verify_compress by default.

    Patch 16) Decompress batching optimization: Find the two largest
              buffers in the batch and submit them first.
             
    Patch 17) Add a new Dynamic compression mode that can be used on
              Granite Rapids.

    Patch 18) Add a batch_size data member to struct acomp_alg and
              a crypto_acomp_batch_size() API that returns the compressor's
              batch-size, if it has defined one; 1 otherwise.

 2) zswap modifications to enable compress batching in zswap_store()
    of large folios (including pmd-mappable folios):

    Patch 19) Simplifies the zswap_pool's per-CPU acomp_ctx resource
              management and lifetime to be from pool creation to pool
              deletion.

    Patch 20) Uses IS_ERR_OR_NULL() in zswap_cpu_comp_prepare() to check for
              valid acomp/req, thereby making it consistent with the resource
              de-allocation code.

    Patch 21) Defines a zswap-specific ZSWAP_MAX_BATCH_SIZE (currently set
              as 8U) to denote the maximum number of acomp_ctx batching
              resources to allocate, thus limiting the amount of extra
              memory used for batching. Further, the "struct
              crypto_acomp_ctx" is modified to contain multiple buffers.
              New "u8 compr_batch_size" member is added to "struct zswap_pool"
              to track the number of dst buffers associated with the compressor
              (more than 1 if the compressor supports batching).

              Modifies zswap_store() to store the folio in batches of
              pool->compr_batch_size (batching compressors) or
              ZSWAP_MAX_BATCH_SIZE (sequential compressors) by calling a new
              zswap_store_pages() that takes a range of indices in the folio to
              be stored.
              
              zswap_store_pages() bulk-allocates zswap entries for the batch,
              calls zswap_compress() for each page in this range, and stores
              the entries in xarray/LRU.

    Patch 22) Introduces a new unified batching implementation of
              zswap_compress() for compressors that do and do not support
              batching. This eliminates code duplication and facilitates
              code maintainability with the introduction of compress
              batching. Further, there are many optimizations to this common
              code that result in workload throughput and performance
              improvements with software compressors and hardware accelerators
              such as IAA.

              zstd performance is better or on par with mm-unstable. We
              see impressive throughput/performance improvements with
              IAA and workload performance/sys time improvement with zstd
              batching vs. no-batching.


With v13 of this patch series, the IAA compress batching feature will be
enabled seamlessly on Intel platforms that have IAA by selecting
'deflate-iaa' as the zswap compressor, and using the iaa_crypto 'async'
sync_mode driver attribute (the default).


System setup for testing:
=========================
Testing of this patch-series was done with mm-unstable as of 10-24-2025,
commit 813c0fa931ce, without and with this patch-series. Data was
gathered on an Intel Sapphire Rapids (SPR) server, dual-socket 56 cores
per socket, 4 IAA devices per socket, each IAA has total 128 WQ entries,
503 GiB RAM and 525G SSD disk partition swap. Core frequency was fixed
at 2500MHz.

Other kernel configuration parameters:

    zswap compressor  : zstd, deflate-iaa
    zswap allocator   : zsmalloc
    vm.page-cluster   : 0

IAA "compression verification" is disabled and IAA is run in the async
mode (the defaults with this series).

I ran experiments with these workloads:

1) usemem 30 processes with zswap shrinker_enabled=N. Two sets of
   experiments, one with 64K folios, another with PMD folios.

2) Kernel compilation allmodconfig with 2G max memory, 32 threads, with
   zswap shrinker_enabled=N to test batching performance impact in
   isolation. Two sets of experiments, one with 64K folios, another with PMD
   folios.

IAA configuration is done by a CLI: script is included at the end of the
cover letter.


Performance testing (usemem30):
===============================
The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 150G. There is no swap limit set for the cgroup. 30 usemem
processes were run, each allocating and writing 10G of memory, and
sleeping for 10 sec before exiting:

 usemem --init-time -w -O -b 1 -s 10 -n 30 10g
 echo 0 > /sys/module/zswap/parameters/shrinker_enabled

 IAA WQ Configuration (script is iincluded at the end of the cover
 letter):

   ./enable_iaa.sh -d 4 -q 1
   
 This enables all 4 IAAs on the socket, and configures 1 WQ per IAA
 device, each containing 128 entries. The driver distributes compress
 jobs from each core to wqX.0 of all same-package IAAs in a
 round-robin manner. Decompress jobs are send to the wqX.0 of the
 mapped IAA device.

 Since usemem has significantly more swapouts than swapins, this
 configuration is the most optimal.

 64K folios: usemem30: deflate-iaa:
 ==================================

 -------------------------------------------------------------------------------
                    mm-unstable-10-24-2025             v13
 -------------------------------------------------------------------------------
 zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
                                                                 vs.
                                                             IAA Sequential
 -------------------------------------------------------------------------------
 Total throughput (KB/s)        6,118,675       9,901,216         62%
 Avg throughput (KB/s)            203,955         330,040         62%       
 elapsed time (sec)                 98.94           70.90        -28%      
 sys time (sec)                  2,379.29        1,686.18        -29%      
                                                                         
 -------------------------------------------------------------------------------
 memcg_high                     1,263,467       1,404,068                
 memcg_swap_fail                    1,728           1,377                
 64kB_swpout_fallback               1,728           1,377                
 zswpout                       58,174,008      64,508,622                
 zswpin                                43             138                
 pswpout                                0               0                
 pswpin                                 0               0                
 ZSWPOUT-64kB                   3,634,162       4,030,643                
 SWPOUT-64kB                            0               0
 pgmajfault                         2,398           2,488
 zswap_reject_compress_fail             0               0
 zswap_reject_reclaim_fail              0               0
 IAA incompressible pages               0               0
 -------------------------------------------------------------------------------


 2M folios: usemem30: deflate-iaa:
 =================================

 -------------------------------------------------------------------------------
                    mm-unstable-10-24-2025             v13
 -------------------------------------------------------------------------------
 zswap compressor             deflate-iaa     deflate-iaa     IAA Batching
                                                                  vs.
                                                              IAA Sequential
 -------------------------------------------------------------------------------
 Total throughput (KB/s)        6,309,635      10,558,225        67%
 Avg throughput (KB/s)            210,321         351,940        67%
 elapsed time (sec)                 88.70           67.84       -24%     
 sys time (sec)                  2,059.83        1,581.07       -23%     
                                                               
 -------------------------------------------------------------------------------
 memcg_high                       116,246         125,218     
 memcg_swap_fail                       41             177     
 thp_swpout_fallback                   41             177     
 zswpout                       59,880,021      64,509,854     
 zswpin                                69             425     
 pswpout                                0               0     
 pswpin                                 0               0     
 ZSWPOUT-2048kB                   116,912         125,822     
 thp_swpout                             0               0     
 pgmajfault                         2,408           4,026
 zswap_reject_compress_fail             0               0
 zswap_reject_reclaim_fail              0               0
 IAA incompressible pages               0               0
 -------------------------------------------------------------------------------


 64K folios: usemem30: zstd:
 ===========================

 -------------------------------------------------------------------------------
                    mm-unstable-10-24-2025             v13        
 -------------------------------------------------------------------------------
 zswap compressor                    zstd            zstd        v13 zstd    
                                                                 improvement  
 -------------------------------------------------------------------------------
 Total throughput (KB/s)        5,983,561       6,003,851         0.3%
 Avg throughput (KB/s)            199,452         200,128         0.3%       
 elapsed time (sec)                100.93           96.62        -4.3%
 sys time (sec)                  2,532.49        2,395.83          -5%
                                                          
 -------------------------------------------------------------------------------
 memcg_high                     1,122,198       1,113,384 
 memcg_swap_fail                      192              55 
 64kB_swpout_fallback                 192              55 
 zswpout                       48,766,907      48,799,863
 zswpin                                89              68 
 pswpout                                0               0 
 pswpin                                 0               0 
 ZSWPOUT-64kB                   3,047,702       3,049,908 
 SWPOUT-64kB                            0               0 
 pgmajfault                         2,428           2,390 
 zswap_reject_compress_fail             0               0 
 zswap_reject_reclaim_fail              0               0 
 -------------------------------------------------------------------------------


 2M folios: usemem30: zstd:
 ==========================

 -------------------------------------------------------------------------------
                    mm-unstable-10-24-2025             v13      
 -------------------------------------------------------------------------------
 zswap compressor                    zstd            zstd        v13 zstd           
                                                                 improvement
 -------------------------------------------------------------------------------
 Total throughput (KB/s)        6,562,687       6,567,946         0.1%
 Avg throughput (KB/s)            218,756         218,931         0.1%    
 elapsed time (sec)                 94.69           88.79          -6%
 sys time (sec)                  2,253.97        2,083.43          -8%
                                                          
 --------------------------------------------------------------------------------
 memcg_high                        92,709          92,686 
 memcg_swap_fail                       33             226 
 thp_swpout_fallback                   33             226 
 zswpout                       47,851,601      47,847,171
 zswpin                                65             441 
 pswpout                                0               0 
 pswpin                                 0               0 
 ZSWPOUT-2048kB                    93,427          93,238 
 thp_swpout                             0               0 
 pgmajfault                         2,382           2,767 
 zswap_reject_compress_fail             0               0 
 zswap_reject_reclaim_fail              0               0 
 -------------------------------------------------------------------------------


Performance testing (Kernel compilation, allmodconfig):
=======================================================

The experiments with kernel compilation test use 32 threads and build
the "allmodconfig" that takes ~14 minutes, and has considerable
swapout/swapin activity. The cgroup's memory.max is set to 2G. zswap
writeback is not enabled so as to isolate the performance impact of only large
folio batch compression.

 echo 0 > /sys/module/zswap/parameters/shrinker_enabled

 IAA WQ Configuration (script is at the end of the cover letter):

   ./enable_iaa.sh -d 4 -q 2
   
 This enables all 4 IAAs on the socket, and configures 2 WQs per IAA,
 each containing 64 entries. The driver sends decompresses to wqX.0 of
 the mapped IAA device, and distributes compresses to wqX.1 of all
 same-package IAAs in a round-robin manner.

 64K folios: Kernel compilation/allmodconfig: deflate-iaa:
 =========================================================

 -------------------------------------------------------------------------------
                    mm-unstable-10-24-2025             v13
 -------------------------------------------------------------------------------
 zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
                                                                 vs.
                                                             IAA Sequential
 -------------------------------------------------------------------------------
 real_sec                          836.64          806.94       -3.5%
 user_sec                       15,702.26       15,695.13
 sys_sec                         3,897.57        3,661.83         -6%
 -------------------------------------------------------------------------------
 Max_Res_Set_Size_KB            1,872,500       1,873,144
 -------------------------------------------------------------------------------
 memcg_high                             0               0        
 memcg_swap_fail                        0               0        
 64kB_swpout_fallback                   0               0        
 zswpout                       94,890,390      93,332,527        
 zswpin                        28,305,656      28,111,525        
 pswpout                                0               0        
 pswpin                                 0               0        
 ZSWPOUT-64kB                   3,088,473       3,018,341        
 SWPOUT-64kB                            0               0
 pgmajfault                    29,958,141      29,776,102
 zswap_reject_compress_fail             0               0
 zswap_reject_reclaim_fail              0               0
 IAA incompressible pages             684             442 
 -------------------------------------------------------------------------------


 2M folios: Kernel compilation/allmodconfig: deflate-iaa:
 ========================================================

 -------------------------------------------------------------------------------
                    mm-unstable-10-24-2025             v13
 -------------------------------------------------------------------------------
 zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
                                                                 vs.
                                                             IAA Sequential
 -------------------------------------------------------------------------------
 real_sec                          818.48          779.67         -4.7%
 user_sec                       15,798.78       15,807.93  
 sys_sec                         4,226.52        4,245.18          0.4%
 -------------------------------------------------------------------------------
 Max_Res_Set_Size_KB            1,871,096       1,871,100  
 -------------------------------------------------------------------------------
 memcg_high                             0               0     
 memcg_swap_fail                        0               0     
 thp_swpout_fallback                    0               0     
 zswpout                      105,675,621     109,930,550          
 zswpin                        36,537,688      38,205,575     
 pswpout                                0               0     
 pswpin                                 0               0     
 ZSWPOUT-2048kB                    15,600          15,800     
 thp_swpout                             0               0     
 pgmajfault                    37,843,091      39,540,387  
 zswap_reject_compress_fail             0               0  
 zswap_reject_reclaim_fail              0               0  
 IAA incompressible pages             188             349  
 -------------------------------------------------------------------------------


With the iaa_crypto driver changes for non-blocking descriptor allocations,
no timeouts-with-mitigations were seen in compress/decompress jobs, for all
of the above experiments.


 64K folios: Kernel compilation/allmodconfig: zstd:
 ==================================================

 -------------------------------------------------------------------------------
                    mm-unstable-10-24-2025             v13
 -------------------------------------------------------------------------------
 zswap compressor                    zstd            zstd    Improvement
 -------------------------------------------------------------------------------
 real_sec                          880.62          850.41        -3.4%
 user_sec                       15,717.23       15,683.17 
 sys_sec                         5,171.90        5,076.51        -1.8%
 -------------------------------------------------------------------------------
 Max_Res_Set_Size_KB            1,871,276       1,874,744 
 -------------------------------------------------------------------------------
 memcg_high                             0               0         
 memcg_swap_fail                        0               0         
 64kB_swpout_fallback                   0               0         
 zswpout                       76,599,637      76,472,392         
 zswpin                        21,833,178      22,538,969         
 pswpout                                0               0         
 pswpin                                 0               0         
 ZSWPOUT-64kB                   2,462,404       2,446,549         
 SWPOUT-64kB                            0               0 
 pgmajfault                    23,027,211      23,830,391 
 zswap_reject_compress_fail             0               0 
 zswap_reject_reclaim_fail              0               0 
 -------------------------------------------------------------------------------


 2M folios: Kernel compilation/allmodconfig: zstd:
 =================================================

 -------------------------------------------------------------------------------
                    mm-unstable-10-24-2025             v13
 -------------------------------------------------------------------------------
 zswap compressor                    zstd            zstd    Improvement
 -------------------------------------------------------------------------------
 real_sec                          888.45          849.54       -4.4%
 user_sec                       15,841.87       15,828.10
 sys_sec                         5,866.72        5,847.17       -0.3%
 -------------------------------------------------------------------------------
 Max_Res_Set_Size_KB            1,871,096       1,872,892
 -------------------------------------------------------------------------------
 memcg_high                             0               0   
 memcg_swap_fail                        0               0   
 thp_swpout_fallback                    0               0   
 zswpout                       89,891,328      90,847,761        
 zswpin                        29,249,656      29,999,617   
 pswpout                                0               0   
 pswpin                                 0               0   
 ZSWPOUT-2048kB                    12,198          12,481   
 thp_swpout                             0               0   
 pgmajfault                    30,077,425      30,915,945
 zswap_reject_compress_fail             0               0
 zswap_reject_reclaim_fail              0               0
 -------------------------------------------------------------------------------



Changes since v12:
==================
1)  Rebased to mm-unstable as of 10-24-2025, commit 813c0fa931ce.
2)  Added "int nid" to zswap_entry to store the page's nid, to preserve zswap
    LRU list/shrinker behavior with bulk allocation, as suggested by Nhat and
    Yosry. No change in memory footprint of struct zswap_entry.
3)  Added a WARN_ON() if kmem_cache_alloc_bulk() returns 0 or a number that's
    different than nr_entries, as suggested by Yosry.
4)  Confirmed that kmem_cache_bulk_free() works for both bulk and non-bulk
    allocated entries, to follow-up on Yosry's comment.
5)  Moved the call to cpuhp_state_remove_instance() to zswap_pool_destroy(), as
    suggested by Yosry.
6)  Variable names changed to "nid" and "wb_enabled", per Yosry's suggestion.
7)  Concise comments in zswap.c, and summarized commit logs, as suggested by
    Yosry.
8)  Minimized branches in zswap_compress().
9)  Deleted allocating extra memory in acomp_req->__ctx[] to statically store
    addresses to SG lists' lengths, as suggested by Herbert.
10) Deleted the iaa_comp API and export symbols, as suggested by Herbert.
11) Deleted @batch_size in struct crypto_acomp. Instead, the value is returned
    from struct acomp_alg directly, as suggested by Herbert.
12) Addressed checkpatch.pl warnings and coding style suggestions in the
    iaa_crypto patches, provided by Vinicius Gomes in internal code
    reviews. Thanks Vinicius!


Changes since v11:
==================
1) Rebased to mm-unstable as of 9-18-2025, commit 1f98191f08b4.
2) Incorporated Herbert's suggestions on submitting the folio as the source and 
   SG lists for the destination to create the compress batching interface from
   zswap to crypto.
3) As per Herbert's suggestion, added a new unit_size member to struct
   acomp_req, along with a acomp_request_set_unit_size() API for kernel modules
   to set the unit size to use while breaking down the request's src/dst
   scatterlists.
4) Implemented iaa_crypto batching using the new SG lists based architecture and
   crypto interfaces.
5) To make the SG lists based approach functional and performant for IAA, I have
   changed all the calls to dma_map_sg() to use nents of 1. This should not be a
   concern, since it eliminates redundant computes to scan an SG list with only
   one scatterlist for existing kernel users, i.e. zswap with the
   zswap_compress() modifications in this series. This will continue to hold
   true with the zram IAA batching support I am developing. There are no kernel
   use cases for the iaa_crypto driver that will break this assumption.
6) Addressed Herbert's comment about batch_size being a statically defined data
   member in struct acomp_alg and struct crypto_acomp.
7) Addressed Nhat's comment about VM_WARN_ON_ONCE(nr_pages >
   ZSWAP_MAX_BATCH_SIZE) in zswap_store_pages().
8) Nhat's comment about deleting struct swap_batch_decomp_data is automatically
   addressed by the SG lists based rewrite of the crypto batching interface.
9) Addressed Barry's comment about renaming pool->batch_size to
   pool->store_batch_size.
10) Incorporated Barry's suggestion to merge patches that introduce data members
    to structures and/or API and their usage.
11) Added performance data to patch 0023's commit log, as suggested by Barry.

Changes since v10:
==================
1) Rebased to mm-unstable as of 7-30-2025, commit 01da54f10fdd.
2) Added change logging in patch 0024 on there being no Intel-specific
   dependencies in the batching framework, as suggested by
   Andrew Morton. Thanks Andrew!
3) Added change logging in patch 0024 on other ongoing work that can use
   batching, as per Andrew's suggestion. Thanks Andrew!
4) Added the IAA configuration script in the cover letter, as suggested
   by Nhat Pham. Thanks Nhat!
5) As suggested by Nhat, dropped patch 0020 from v10, that moves CPU
   hotplug procedures to pool functions.
6) Gathered kernel_compilation 'allmod' config performance data with
   writeback and zswap shrinker_enabled=Y.
7) Changed the pool->batch_size for software compressors to be
   ZSWAP_MAX_BATCH_SIZE since this gives better performance with the zswap
   shrinker enabled.
8) Was unable to replicate in v11 the issue seen in v10 with higher
   memcg_swap_fail than in the baseline, with usemem30/zstd.

Changes since v9:
=================
1) Rebased to mm-unstable as of 6-24-2025, commit 23b9c0472ea3.
2) iaa_crypto rearchitecting, mainline race condition fix, performance
   optimizations, code cleanup.
3) Addressed Herbert's comments in v9 patch 10, that an array based
   crypto_acomp interface is not acceptable.
4) Optimized the implementation of the batching zswap_compress() and
   zswap_store_pages() added in v9, to recover performance when
   integrated with the changes in commit 56e5a103a721 ("zsmalloc: prefer
   the the original page's node for compressed data").

Changes since v8:
=================
1) Rebased to mm-unstable as of 4-21-2025, commit 2c01d9f3c611.
2) Backported commits for reverting request chaining, since these are
   in cryptodev-2.6 but not yet in mm-unstable: without these backports,
   deflate-iaa is non-functional in mm-unstable:
   commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining")
   commit 5976fe19e240 ("Revert "crypto: testmgr - Add multibuffer acomp
                         testing"")
   Backported this hotfix as well:
   commit 002ba346e3d7 ("crypto: scomp - Fix off-by-one bug when
   calculating last page").
3) crypto_acomp_[de]compress() restored to non-request chained
   implementations since request chaining has been removed from acomp in
   commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining").
4) New IAA WQ architecture to denote WQ type and whether or not a WQ
   should be shared among all package cores, or only to the "mapped"
   ones from an even cores-to-IAA distribution scheme.
5) Compress/decompress batching are implemented in iaa_crypto using new
   crypto_acomp_batch_compress()/crypto_acomp_batch_decompress() API.
6) Defines a "void *data" in struct acomp_req, based on Herbert advising
   against using req->base.data in the driver. This is needed for async
   submit-poll to work.
7) In zswap.c, moved the CPU hotplug callbacks to reside in "pool
   functions", per Yosry's suggestion to move procedures in a distinct
   patch before refactoring patches.
8) A new "u8 nr_reqs" member is added to "struct zswap_pool" to track
   the number of requests/buffers associated with the per-cpu acomp_ctx,
   as per Yosry's suggestion.
9) Simplifications to the acomp_ctx resources allocation, deletion,
   locking, and for these to exist from pool creation to pool deletion,
   based on v8 code review discussions with Yosry.
10) Use IS_ERR_OR_NULL() consistently in zswap_cpu_comp_prepare() and
    acomp_ctx_dealloc(), as per Yosry's v8 comment.
11) zswap_store_folio() is deleted, and instead, the loop over
    zswap_store_pages() is moved inline in zswap_store(), per Yosry's
    suggestion.
12) Better structure in zswap_compress(), unified procedure that
    compresses/stores a batch of pages for both, non-batching and
    batching compressors. Renamed from zswap_batch_compress() to
    zswap_compress(): Thanks Yosry for these suggestions.


Changes since v7:
=================
1) Rebased to mm-unstable as of 3-3-2025, commit 5f089a9aa987.
2) Changed the acomp_ctx->nr_reqs to be u8 since ZSWAP_MAX_BATCH_SIZE is
   defined as 8U, for saving memory in this per-cpu structure.
3) Fixed a typo in code comments in acomp_ctx_get_cpu_lock():
   acomp_ctx->initialized to acomp_ctx->__online.
4) Incorporated suggestions from Yosry, Chengming, Nhat and Johannes,
   thanks to all!
   a) zswap_batch_compress() replaces zswap_compress(). Thanks Yosry
      for this suggestion!
   b) Process the folio in sub-batches of ZSWAP_MAX_BATCH_SIZE, regardless
      of whether or not the compressor supports batching. This gets rid of
      the kmalloc(entries), and allows us to allocate an array of
      ZSWAP_MAX_BATCH_SIZE entries on the stack. This is implemented in
      zswap_store_pages().
   c) Use of a common structure and code paths for compressing a folio in
      batches, either as a request chain (in parallel in IAA hardware) or
      sequentially. No code duplication since zswap_compress() has been
      replaced with zswap_batch_compress(), simplifying maintainability.
5) A key difference between compressors that support batching and
   those that do not, is that for the latter, the acomp_ctx mutex is
   locked/unlocked per ZSWAP_MAX_BATCH_SIZE batch, so that decompressions
   to handle page-faults can make progress. This fixes the zstd kernel
   compilation regression seen in v7. For compressors that support
   batching, for e.g. IAA, the mutex is locked/released once for storing
   the folio.
6) Used likely/unlikely compiler directives and prefetchw to restore
   performance with the common code paths.

Changes since v6:
=================
1) Rebased to mm-unstable as of 2-27-2025, commit d58172d128ac.

2) Deleted crypto_acomp_batch_compress() and
   crypto_acomp_batch_decompress() interfaces, as per Herbert's
   suggestion. Batching is instead enabled by chaining the requests. For
   non-batching compressors, there is no request chaining involved. Both,
   batching and non-batching compressions are accomplished by zswap by
   calling:

   crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);

3) iaa_crypto implementation of batch compressions/decompressions using
   request chaining, as per Herbert's suggestions.
4) Simplification of the acomp_ctx resource allocation/deletion with
   respect to CPU hot[un]plug, to address Yosry's suggestions to explore the
   mutex options in zswap_cpu_comp_prepare(). Yosry, please let me know if
   the per-cpu memory cost of this proposed change is acceptable (IAA:
   64.8KB, Software compressors: 8.2KB). On the positive side, I believe
   restarting reclaim on a CPU after it has been through an offline-online
   transition, will be much faster by not deleting the acomp_ctx resources
   when the CPU gets offlined.
5) Use of lockdep assertions rather than comments for internal locking
   rules, as per Yosry's suggestion.
6) No specific references to IAA in zswap.c, as suggested by Yosry.
7) Explored various solutions other than the v6 zswap_store_folio()
   implementation, to fix the zstd regression seen in v5, to attempt to
   unify common code paths, and to allocate smaller arrays for the zswap
   entries on the stack. All these options were found to cause usemem30
   latency regression with zstd. The v6 version of zswap_store_folio() is
   the only implementation that does not cause zstd regression, confirmed
   by 10 consecutive runs, each giving quite consistent latency
   numbers. Hence, the v6 implementation is carried forward to v7, with
   changes for branching for batching vs. sequential compression API
   calls.


Changes since v5:
=================
1) Rebased to mm-unstable as of 2-1-2025, commit 7de6fd8ab650.

Several improvements, regression fixes and bug fixes, based on Yosry's
v5 comments (Thanks Yosry!):

2) Fix for zstd performance regression in v5.
3) Performance debug and fix for marginal improvements with IAA batching
   vs. sequential.
4) Performance testing data compares IAA with and without batching, instead
   of IAA batching against zstd.
5) Commit logs/zswap comments not mentioning crypto_acomp implementation
   details.
6) Delete the pr_info_once() when batching resources are allocated in
   zswap_cpu_comp_prepare().
7) Use kcalloc_node() for the multiple acomp_ctx buffers/reqs in
   zswap_cpu_comp_prepare().
8) Simplify and consolidate error handling cleanup code in
   zswap_cpu_comp_prepare().
9) Introduce zswap_compress_folio() in a separate patch.
10) Bug fix in zswap_store_folio() when xa_store() failure can cause all
    compressed objects and entries to be freed, and UAF when zswap_store()
    tries to free the entries that were already added to the xarray prior
    to the failure.
11) Deleting compressed_bytes/bytes. zswap_store_folio() also comprehends
    the recent fixes in commit bf5eaaaf7941 ("mm/zswap: fix inconsistency
    when zswap_store_page() fails") by Hyeonggon Yoo.

iaa_crypto improvements/fixes/changes:

12) Enables asynchronous mode and makes it the default. With commit
    4ebd9a5ca478 ("crypto: iaa - Fix IAA disabling that occurs when
    sync_mode is set to 'async'"), async mode was previously just sync. We
    now have true async support.
13) Change idxd descriptor allocations from blocking to non-blocking with
    timeouts, and mitigations for compress/decompress ops that fail to
    obtain a descriptor. This is a fix for tasks blocked errors seen in
    configurations where 30+ cores are running workloads under high memory
    pressure, and sending comps/decomps to 1 IAA device.
14) Fixes a bug with unprotected access of "deflate_generic_tfm" in
    deflate_generic_decompress(), which can cause data corruption and
    zswap_decompress() kernel crash.
15) zswap uses crypto_acomp_batch_compress() with async polling instead of
    request chaining for slightly better latency. However, the request
    chaining framework itself is unchanged, preserved from v5.


Changes since v4:
=================
1) Rebased to mm-unstable as of 12-20-2024, commit 5555a83c82d6.
2) Added acomp request chaining, as suggested by Herbert. Thanks Herbert!
3) Implemented IAA compress batching using request chaining.
4) zswap_store() batching simplifications suggested by Chengming, Yosry and
   Nhat, thanks to all!
   - New zswap_compress_folio() that is called by zswap_store().
   - Move the loop over folio's pages out of zswap_store() and into a
     zswap_store_folio() that stores all pages.
   - Allocate all zswap entries for the folio upfront.
   - Added zswap_batch_compress().
   - Branch to call zswap_compress() or zswap_batch_compress() inside
     zswap_compress_folio().
   - All iterations over pages kept in same function level.
   - No helpers other than the newly added zswap_store_folio() and
     zswap_compress_folio().


Changes since v3:
=================
1) Rebased to mm-unstable as of 11-18-2024, commit 5a7056135bb6.
2) Major re-write of iaa_crypto driver's mapping of IAA devices to cores,
   based on packages instead of NUMA nodes.
3) Added acomp_has_async_batching() API to crypto acomp, that allows
   zswap/zram to query if a crypto_acomp has registered batch_compress and
   batch_decompress interfaces.
4) Clear the poll bits on the acomp_reqs passed to
   iaa_comp_a[de]compress_batch() so that a module like zswap can be
   confident about the acomp_reqs[0] not having the poll bit set before
   calling the fully synchronous API crypto_acomp_[de]compress().
   Herbert, I would appreciate it if you can review changes 2-4; in patches
   1-8 in v4. I did not want to introduce too many iaa_crypto changes in
   v4, given that patch 7 is already making a major change. I plan to work
   on incorporating the request chaining using the ahash interface in v5
   (I need to understand the basic crypto ahash better). Thanks Herbert!
5) Incorporated Johannes' suggestion to not have a sysctl to enable
   compress batching.
6) Incorporated Yosry's suggestion to allocate batching resources in the
   cpu hotplug onlining code, since there is no longer a sysctl to control
   batching. Thanks Yosry!
7) Incorporated Johannes' suggestions related to making the overall
   sequence of events between zswap_store() and zswap_batch_store() similar
   as much as possible for readability and control flow, better naming of
   procedures, avoiding forward declarations, not inlining error path
   procedures, deleting zswap internal details from zswap.h, etc. Thanks
   Johannes, really appreciate the direction!
   I have tried to explain the minimal future-proofing in terms of the
   zswap_batch_store() signature and the definition of "struct
   zswap_batch_store_sub_batch" in the comments for this struct. I hope the
   new code explains the control flow a bit better.


Changes since v2:
=================
1) Rebased to mm-unstable as of 11-5-2024, commit 7994b7ea6ac8.
2) Fixed an issue in zswap_create_acomp_ctx() with checking for NULL
   returned by kmalloc_node() for acomp_ctx->buffers and for
   acomp_ctx->reqs.
3) Fixed a bug in zswap_pool_can_batch() for returning true if
   pool->can_batch_comp is found to be equal to BATCH_COMP_ENABLED, and if
   the per-cpu acomp_batch_ctx tests true for batching resources having
   been allocated on this cpu. Also, changed from per_cpu_ptr() to
   raw_cpu_ptr().
4) Incorporated the zswap_store_propagate_errors() compilation warning fix
   suggested by Dan Carpenter. Thanks Dan!
5) Replaced the references to SWAP_CRYPTO_SUB_BATCH_SIZE in comments in
   zswap.h, with SWAP_CRYPTO_BATCH_SIZE.

Changes since v1:
=================
1) Rebased to mm-unstable as of 11-1-2024, commit 5c4cf96cd702.
2) Incorporated Herbert's suggestions to use an acomp_req flag to indicate
   async/poll mode, and to encapsulate the polling functionality in the
   iaa_crypto driver. Thanks Herbert!
3) Incorporated Herbert's and Yosry's suggestions to implement the batching
   API in iaa_crypto and to make its use seamless from zswap's
   perspective. Thanks Herbert and Yosry!
4) Incorporated Yosry's suggestion to make it more convenient for the user
   to enable compress batching, while minimizing the memory footprint
   cost. Thanks Yosry!
5) Incorporated Yosry's suggestion to de-couple the shrink_folio_list()
   reclaim batching patch from this series, since it requires a broader
   discussion.


IAA configuration script "enable_iaa.sh":
=========================================

 Acknowledgements: Binuraj Ravindran and Rakib Al-Fahad.

 Usage:
 ------

   ./enable_iaa.sh -d <num_IAAs> -q <num_WQs_per_IAA>


 #---------------------------------<cut here>----------------------------------
 #!/usr/bin/env bash
 #SPDX-License-Identifier: BSD-3-Clause
 #Copyright (c) 2025, Intel Corporation
 #Description: Configure IAA devices
 
 VERIFY_COMPRESS_PATH="/sys/bus/dsa/drivers/crypto/verify_compress"
 
 iax_dev_id="0cfe"
 num_iaa=$(lspci -d:${iax_dev_id} | wc -l)
 sockets=$(lscpu | grep Socket | awk '{print $2}')
 echo "Found ${num_iaa} instances in ${sockets} sockets(s)"
 
 #  The same number of devices will be configured in each socket, if there
 #  are  more than one socket.
 #  Normalize with respect to the number of sockets.
 device_num_per_socket=$(( num_iaa/sockets ))
 num_iaa_per_socket=$(( num_iaa / sockets ))
 
 iaa_wqs=2
 verbose=0
 iaa_engines=8
 mode="dedicated"
 wq_type="kernel"
 iaa_crypto_mode="async"
 verify_compress=0
 
 
 # Function to handle errors
 handle_error() {
     echo "Error: $1"
     exit 1
 }
 
 # Process arguments
 
 while getopts "d:hm:q:vD" opt; do
   case $opt in
     d)
       device_num_per_socket=$OPTARG
       ;;
     m)
       iaa_crypto_mode=$OPTARG
       ;;
     q)
       iaa_wqs=$OPTARG
       ;;
     D)
       verbose=1
       ;;
     v)
       verify_compress=1
       ;;
     h)
       echo "Usage: $0 [-d <device_count>][-q <wq_per_device>][-v]"
       echo "       -d - number of devices"
       echo "       -q - number of WQs per device"
       echo "       -v - verbose mode"
       echo "       -h - help"
       exit
       ;;
     \?)
       echo "Invalid option: -$OPTARG" >&2
       exit
       ;;
   esac
 done
 
 LOG="configure_iaa.log"
 
 # Update wq_size based on number of wqs
 wq_size=$(( 128 / iaa_wqs ))
 
 # Take care of the enumeration, if DSA is enabled.
 dsa=`lspci | grep -c 0b25`
 # set first,step counters to correctly enumerate iax devices based on
 # whether running on guest or host with or without dsa
 first=0
 step=1
 [[ $dsa -gt 0 && -d /sys/bus/dsa/devices/dsa0 ]] && first=1 && step=2
 echo "first index: ${first}, step: ${step}"
 
 
 #
 # Switch to software compressors and disable IAAs to have a clean start
 #
 COMPRESSOR=/sys/module/zswap/parameters/compressor
 last_comp=`cat ${COMPRESSOR}`
 echo lzo > ${COMPRESSOR}
 
 echo "Disable IAA devices before configuring"
 
 for ((i = ${first}; i < ${step} * ${num_iaa}; i += ${step})); do
     for ((j = 0; j < ${iaa_wqs}; j += 1)); do
         cmd="accel-config disable-wq iax${i}/wq${i}.${j} >& /dev/null"
        [[ $verbose == 1 ]] && echo $cmd; eval $cmd
      done
     cmd="accel-config disable-device iax${i} >& /dev/null"
     [[ $verbose == 1 ]] && echo $cmd; eval $cmd
 done
 
 rmmod iaa_crypto
 modprobe iaa_crypto
 
 # apply crypto parameters
 echo $verify_compress > ${VERIFY_COMPRESS_PATH} || handle_error "did not change verify_compress"
 # Note: This is a temporary solution for during the kernel transition.
 if [ -f /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa ];then
     echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa || handle_error "did not set g_comp_wqs_per_iaa"
 elif [ -f /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa ];then
     echo 1 > /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa || handle_error "did not set g_wqs_per_iaa"
 fi
 if [ -f /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq ];then
     echo 1 > /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq || handle_error "did not set g_consec_descs_per_gwq"
 fi
 echo ${iaa_crypto_mode} > /sys/bus/dsa/drivers/crypto/sync_mode || handle_error "could not set sync_mode"
 
 
 
 echo "Configuring ${device_num_per_socket} device(s) out of $num_iaa_per_socket per socket"
 if [ "${device_num_per_socket}" -le "${num_iaa_per_socket}" ]; then
     echo "Configuring all devices"
     start=${first}
     end=$(( ${step} * ${device_num_per_socket} ))
 else
    echo "ERROR: Not enough devices"
    exit
 fi
 
 
 #
 # enable all iax devices and wqs
 #
 for (( socket = 0; socket < ${sockets}; socket += 1 )); do
 for ((i = ${start}; i < ${end}; i += ${step})); do
 
     echo "Configuring iaa$i on socket ${socket}"
 
     for ((j = 0; j < ${iaa_engines}; j += 1)); do
         cmd="accel-config config-engine iax${i}/engine${i}.${j} --group-id=0"
         [[ $verbose == 1 ]] && echo $cmd; eval $cmd
     done
 
     # Config  WQs
     for ((j = 0; j < ${iaa_wqs}; j += 1)); do
         # Config WQ: group 0,  priority=10, mode=shared, type = kernel name=kernel, driver_name=crypto
         cmd="accel-config config-wq iax${i}/wq${i}.${j} -g 0 -s ${wq_size} -p 10 -m ${mode} -y ${wq_type} -n iaa_crypto${i}${j} -d crypto"
         [[ $verbose == 1 ]] && echo $cmd; eval $cmd
      done
 
     # Enable Device and WQs
     cmd="accel-config enable-device iax${i}"
     [[ $verbose == 1 ]] && echo $cmd; eval $cmd
 
     for ((j = 0; j < ${iaa_wqs}; j += 1)); do
         cmd="accel-config enable-wq iax${i}/wq${i}.${j}"
         [[ $verbose == 1 ]] && echo $cmd; eval $cmd
      done
 
 done
     start=$(( start + ${step} * ${num_iaa_per_socket} ))
     end=$(( start + (${step} * ${device_num_per_socket}) ))
 done
 
 # Restore the last compressor
 echo "$last_comp" > ${COMPRESSOR}
 
 # Check if the configuration is correct
 echo "Configured IAA devices:"
 accel-config list | grep iax
 
 #---------------------------------<cut here>----------------------------------


I would greatly appreciate code review comments for the iaa_crypto driver
and mm patches included in this series!

Thanks,
Kanchana



Kanchana P Sridhar (22):
  crypto: iaa - Reorganize the iaa_crypto driver code.
  crypto: iaa - New architecture for IAA device WQ comp/decomp usage &
    core mapping.
  crypto: iaa - Simplify, consistency of function parameters, minor
    stats bug fix.
  crypto: iaa - Descriptor allocation timeouts with mitigations.
  crypto: iaa - iaa_wq uses percpu_refs for get/put reference counting.
  crypto: iaa - Simplify the code flow in iaa_compress() and
    iaa_decompress().
  crypto: iaa - Refactor hardware descriptor setup into separate
    procedures.
  crypto: iaa - Simplified, efficient job submissions for non-irq mode.
  crypto: iaa - Deprecate exporting add/remove IAA compression modes.
  crypto: iaa - Expect a single scatterlist for a [de]compress request's
    src/dst.
  crypto: iaa - Rearchitect iaa_crypto to have clean interfaces with
    crypto_acomp
  crypto: acomp - Define a unit_size in struct acomp_req to enable
    batching.
  crypto: iaa - IAA Batching for parallel compressions/decompressions.
  crypto: iaa - Enable async mode and make it the default.
  crypto: iaa - Disable iaa_verify_compress by default.
  crypto: iaa - Submit the two largest source buffers first in
    decompress batching.
  crypto: iaa - Add deflate-iaa-dynamic compression mode.
  crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's
    batch-size.
  mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to
    deletion.
  mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx
    resources.
  mm: zswap: zswap_store() will process a large folio in batches.
  mm: zswap: Batched zswap_compress() with compress batching of large
    folios.

 .../driver-api/crypto/iaa/iaa-crypto.rst      |  168 +-
 crypto/acompress.c                            |   14 +
 crypto/testmgr.c                              |   10 +
 crypto/testmgr.h                              |   74 +
 drivers/crypto/intel/iaa/Makefile             |    4 +-
 drivers/crypto/intel/iaa/iaa_crypto.h         |   87 +-
 .../intel/iaa/iaa_crypto_comp_dynamic.c       |   22 +
 drivers/crypto/intel/iaa/iaa_crypto_main.c    | 2836 ++++++++++++-----
 drivers/crypto/intel/iaa/iaa_crypto_stats.c   |    8 +
 drivers/crypto/intel/iaa/iaa_crypto_stats.h   |    2 +
 include/crypto/acompress.h                    |   48 +
 include/crypto/internal/acompress.h           |    3 +
 mm/zswap.c                                    |  700 ++--
 13 files changed, 2905 insertions(+), 1071 deletions(-)
 create mode 100644 drivers/crypto/intel/iaa/iaa_crypto_comp_dynamic.c

-- 
2.27.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ