linux-kernel - RE: [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <PH7PR11MB8121C5F339E8BF5330496D4BC9832@PH7PR11MB8121.namprd11.prod.outlook.com>
Date: Wed, 30 Apr 2025 21:05:50 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
	<hannes@...xchg.org>, "nphamcs@...il.com" <nphamcs@...il.com>,
	"chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
	"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com"
	<ryan.roberts@....com>, "21cnbao@...il.com" <21cnbao@...il.com>,
	"ying.huang@...ux.alibaba.com" <ying.huang@...ux.alibaba.com>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
	"herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>,
	"davem@...emloft.net" <davem@...emloft.net>, "clabbe@...libre.com"
	<clabbe@...libre.com>, "ardb@...nel.org" <ardb@...nel.org>,
	"ebiggers@...gle.com" <ebiggers@...gle.com>, "surenb@...gle.com"
	<surenb@...gle.com>, "Accardi, Kristen C" <kristen.c.accardi@...el.com>,
	"Feghali, Wajdi K" <wajdi.k.feghali@...el.com>, "Gopal, Vinodh"
	<vinodh.gopal@...el.com>, "Sridhar, Kanchana P"
	<kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v8 14/14] mm: zswap: Compress batching with request
 chaining in zswap_store() of large folios.


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@...ux.dev>
> Sent: Thursday, March 6, 2025 1:17 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; nphamcs@...il.com; chengming.zhou@...ux.dev;
> usamaarif642@...il.com; ryan.roberts@....com; 21cnbao@...il.com;
> ying.huang@...ux.alibaba.com; akpm@...ux-foundation.org; linux-
> crypto@...r.kernel.org; herbert@...dor.apana.org.au;
> davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> <kristen.c.accardi@...el.com>; Feghali, Wajdi K <wajdi.k.feghali@...el.com>;
> Gopal, Vinodh <vinodh.gopal@...el.com>
> Subject: Re: [PATCH v8 14/14] mm: zswap: Compress batching with request
> chaining in zswap_store() of large folios.
> 
> On Mon, Mar 03, 2025 at 12:47:24AM -0800, Kanchana P Sridhar wrote:
> > This patch introduces three new procedures:
> >
> >  zswap_store_folio()
> >  zswap_store_pages()
> >  zswap_batch_compress()
> >
> > zswap_store_folio() stores a folio in chunks of "batch_size" pages. If the
> > compressor supports batching and has multiple acomp_ctx->nr_reqs, the
> > batch_size will be the acomp_ctx->nr_reqs. If the compressor does not
> > support batching, the batch_size will be ZSWAP_MAX_BATCH_SIZE. The
> > compressor having multiple acomp_ctx->nr_reqs is passed as a bool
> > "batching" parameter to zswap_store_pages() and
> zswap_batch_compress().
> > This refactorization allows us to move the loop over the folio's pages from
> > zswap_store() to zswap_store_folio(), and also enables batching.
> >
> > zswap_store_pages() implements all the computes done earlier in
> > zswap_store_page() for a single-page, for multiple pages in a folio, namely
> > a "batch". zswap_store_pages() starts by allocating all zswap entries
> > required to store the batch. Next, it calls zswap_batch_compress() to
> > compress the batch. Finally, it adds the batch's zswap entries to the
> > xarray and LRU, charges zswap memory and increments zswap stats.
> >
> > The error handling and cleanup required for all failure scenarios that can
> > occur while storing a batch in zswap are consolidated to a single
> > "store_pages_failed" label in zswap_store_pages().
> >
> > And finally, this patch introduces zswap_batch_compress(), which does the
> > following:
> >   - If the compressor supports batching, sets up a request chain for
> >     compressing the batch in one shot. If Intel IAA is the zswap
> >     compressor, the request chain will be compressed in parallel in
> >     hardware. If all requests in the chain are compressed without errors,
> >     the compressed buffers are then stored in zpool.
> >   - If the compressor does not support batching, each page in the batch is
> >     compressed and stored sequentially. zswap_batch_compress() replaces
> >     zswap_compress(), thereby eliminating code duplication and facilitating
> >     maintainability of the code with the introduction of batching.
> >
> > The call to the crypto layer is exactly the same in both cases: when batch
> > compressing a request chain or when sequentially compressing each page in
> > the batch.
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>
> > ---
> 
> High-level comment: It may be better to break this down into two
> patches, one reworking the code to process folios in batches, and one to
> add support for compression batching and request chaining. This may make
> it more digestable.

Hi Yosry, 

Thanks a lot for all your suggestions!

I have addressed this in v9.

> 
> >  mm/zswap.c | 396 ++++++++++++++++++++++++++++++++++++-------------
> ----
> >  1 file changed, 270 insertions(+), 126 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index fae59d6d5147..135d5792ce50 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1051,74 +1051,141 @@ static void acomp_ctx_put_unlock(struct
> crypto_acomp_ctx *acomp_ctx)
> >  	mutex_unlock(&acomp_ctx->mutex);
> >  }
> >
> > -static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > -			   struct zswap_pool *pool)
> > +/*
> > + * Unified code paths for compressors that do and do not support
> > + * batching. This procedure will compress multiple @nr_pages in @folio,
> > + * starting from @index.
> > + * If @batching is set to true, it will create a request chain for
> > + * compression batching. It is assumed that the caller has verified
> > + * that the acomp_ctx->nr_reqs is at least @nr_pages.
> > + * If @batching is set to false, it will process each page sequentially.
> > + * In both cases, if all compressions were successful, it will proceed
> > + * to store the compressed buffers in zpool.
> > + */
> > +static bool zswap_batch_compress(struct folio *folio,
> > +				 long index,
> > +				 unsigned int nr_pages,
> > +				 struct zswap_entry *entries[],
> > +				 struct zswap_pool *pool,
> > +				 struct crypto_acomp_ctx *acomp_ctx,
> > +				 bool batching)
> 
> If we have a single compress function let's keep it called
> zswap_compress() please.

Yes, I have addressed this in v9.

> 
> >  {
> > -	struct crypto_acomp_ctx *acomp_ctx;
> > -	struct scatterlist input, output;
> > -	int comp_ret = 0, alloc_ret = 0;
> > -	unsigned int dlen = PAGE_SIZE;
> > -	unsigned long handle;
> > -	struct zpool *zpool;
> > -	char *buf;
> > +	struct scatterlist inputs[ZSWAP_MAX_BATCH_SIZE];
> > +	struct scatterlist outputs[ZSWAP_MAX_BATCH_SIZE];
> > +	struct zpool *zpool = pool->zpool;
> > +	int acomp_idx = 0, nr_to_store = 1;
> > +	unsigned int i, j;
> > +	int err = 0;
> >  	gfp_t gfp;
> > -	u8 *dst;
> >
> > -	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> > -	dst = acomp_ctx->buffers[0];
> > -	sg_init_table(&input, 1);
> > -	sg_set_page(&input, page, PAGE_SIZE, 0);
> > +	lockdep_assert_held(&acomp_ctx->mutex);
> >
> > -	/*
> > -	 * We need PAGE_SIZE * 2 here since there maybe over-compression
> case,
> > -	 * and hardware-accelerators may won't check the dst buffer size, so
> > -	 * giving the dst buffer with enough length to avoid buffer overflow.
> > -	 */
> > -	sg_init_one(&output, dst, PAGE_SIZE * 2);
> > -	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output,
> PAGE_SIZE, dlen);
> > -
> > -	/*
> > -	 * it maybe looks a little bit silly that we send an asynchronous
> request,
> > -	 * then wait for its completion synchronously. This makes the process
> look
> > -	 * synchronous in fact.
> > -	 * Theoretically, acomp supports users send multiple acomp requests
> in one
> > -	 * acomp instance, then get those requests done simultaneously. but
> in this
> > -	 * case, zswap actually does store and load page by page, there is no
> > -	 * existing method to send the second page before the first page is
> done
> > -	 * in one thread doing zwap.
> > -	 * but in different threads running on different cpu, we have different
> > -	 * acomp instance, so multiple threads can do (de)compression in
> parallel.
> > -	 */
> > -	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> >reqs[0]), &acomp_ctx->wait);
> > -	dlen = acomp_ctx->reqs[0]->dlen;
> > -	if (comp_ret)
> > -		goto unlock;
> > -
> > -	zpool = pool->zpool;
> >  	gfp = __GFP_NORETRY | __GFP_NOWARN |
> __GFP_KSWAPD_RECLAIM;
> >  	if (zpool_malloc_support_movable(zpool))
> >  		gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
> > -	alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle);
> > -	if (alloc_ret)
> > -		goto unlock;
> >
> > -	buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
> > -	memcpy(buf, dst, dlen);
> > -	zpool_unmap_handle(zpool, handle);
> > +	for (i = 0; i < nr_pages; ++i) {
> > +		struct page *page = folio_page(folio, index + i);
> >
> > -	entry->handle = handle;
> > -	entry->length = dlen;
> > +		sg_init_table(&inputs[acomp_idx], 1);
> > +		sg_set_page(&inputs[acomp_idx], page, PAGE_SIZE, 0);
> >
> > -unlock:
> > -	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> > -		zswap_reject_compress_poor++;
> > -	else if (comp_ret)
> > -		zswap_reject_compress_fail++;
> > -	else if (alloc_ret)
> > -		zswap_reject_alloc_fail++;
> > +		/*
> > +		 * Each dst buffer should be of size (PAGE_SIZE * 2).
> > +		 * Reflect same in sg_list.
> 
> The existing comment about overcompression was useful, please move it
> as-is.

Done.

> 
> > +		 */
> > +		sg_init_one(&outputs[acomp_idx], acomp_ctx-
> >buffers[acomp_idx], PAGE_SIZE * 2);
> > +		acomp_request_set_params(acomp_ctx->reqs[acomp_idx],
> &inputs[acomp_idx],
> > +					 &outputs[acomp_idx], PAGE_SIZE,
> PAGE_SIZE);
> > +
> > +		if (batching) {
> > +			/* Add the acomp request to the chain. */
> > +			if (likely(i))
> > +				acomp_request_chain(acomp_ctx-
> >reqs[acomp_idx], acomp_ctx->reqs[0]);
> > +			else
> > +				acomp_reqchain_init(acomp_ctx->reqs[0], 0,
> crypto_req_done,
> > +						    &acomp_ctx->wait);
> > +
> > +			if (i == (nr_pages - 1)) {
> > +				/* Process the request chain. */
> > +				err =
> crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]),
> &acomp_ctx->wait);
> > +
> > +				/*
> > +				 * Get the individual compress errors from
> request chaining.
> > +				 */
> > +				for (j = 0; j < nr_pages; ++j) {
> > +					if
> (unlikely(acomp_request_err(acomp_ctx->reqs[j]))) {
> > +						err = -EINVAL;
> > +						if
> (acomp_request_err(acomp_ctx->reqs[j]) == -ENOSPC)
> > +
> 	zswap_reject_compress_poor++;
> > +						else
> > +
> 	zswap_reject_compress_fail++;
> > +					}
> > +				}
> > +				/*
> > +				 * Request chaining cleanup:
> > +				 *
> > +				 * - Clear the CRYPTO_TFM_REQ_CHAIN bit
> on acomp_ctx->reqs[0].
> > +				 * - Reset the acomp_ctx->wait to notify
> acomp_ctx->reqs[0].
> > +				 */
> > +				acomp_reqchain_clear(acomp_ctx->reqs[0],
> &acomp_ctx->wait);
> > +				if (unlikely(err))
> > +					return false;
> > +				j = 0;
> > +				nr_to_store = nr_pages;
> > +				goto store_zpool;
> > +			}
> > +
> > +			++acomp_idx;
> > +			continue;
> 
> The code structure here is really hard to read:
> 
> (1) Why do we need 'batching' to begin with? Can't we use the same
> request chaining APIs even for the '!batching' case? Are there any
> regressions for doing so?
> 
> (2) Instead of the control flow having 'continue', 'goto', and nested
> loops, can we just have separate loops?
> 
> IOW can we do something like:
> 
> 	for (..) {
> 		if (i == 0)
> 			acomp_reqchain_init(..);
> 		else
> 			acomp_request_chain(..);
> 	}
> 
> 	crypto_wait_req(..);
> 	/* collect errors */
> 	acomp_reqchain_clear(..);
> 
> 	for (..) {
> 		/* store */
> 	}
> 

Thanks for these suggestions! Hopefully the v9 version is much simpler.

> > +		} else {
> > +			err =
> crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]),
> &acomp_ctx->wait);
> > +
> > +			if (unlikely(err)) {
> > +				if (err == -ENOSPC)
> > +					zswap_reject_compress_poor++;
> > +				else
> > +					zswap_reject_compress_fail++;
> > +				return false;
> > +			}
> > +			j = i;
> > +			nr_to_store = 1;
> > +		}
> >
> > -	acomp_ctx_put_unlock(acomp_ctx);
> > -	return comp_ret == 0 && alloc_ret == 0;
> > +store_zpool:
> > +		/*
> > +		 * All batch pages were successfully compressed.
> > +		 * Store the pages in zpool.
> > +		 */
> > +		acomp_idx = -1;
> > +		while (nr_to_store--) {
> > +			unsigned long handle;
> > +			char *buf;
> > +
> > +			++acomp_idx;
> > +			prefetchw(entries[j]);
> > +			err = zpool_malloc(zpool, acomp_ctx-
> >reqs[acomp_idx]->dlen, gfp, &handle);
> > +
> > +			if (unlikely(err)) {
> > +				if (err == -ENOSPC)
> > +					zswap_reject_compress_poor++;
> > +				else
> > +					zswap_reject_alloc_fail++;
> > +
> > +				return false;
> > +			}
> > +
> > +			buf = zpool_map_handle(zpool, handle,
> ZPOOL_MM_WO);
> > +			memcpy(buf, acomp_ctx->buffers[acomp_idx],
> acomp_ctx->reqs[acomp_idx]->dlen);
> > +			zpool_unmap_handle(zpool, handle);
> > +
> > +			entries[j]->handle = handle;
> > +			entries[j]->length = acomp_ctx->reqs[acomp_idx]-
> >dlen;
> > +			++j;
> > +		}
> > +	}
> > +
> > +	return true;
> >  }
> >
> >  static void zswap_decompress(struct zswap_entry *entry, struct folio
> *folio)
> > @@ -1581,84 +1648,165 @@ static void shrink_worker(struct work_struct
> *w)
> >  * main API
> >  **********************************/
> >
> > -static bool zswap_store_page(struct page *page,
> > -			     struct obj_cgroup *objcg,
> > -			     struct zswap_pool *pool)
> > +/*
> > + * Store multiple pages in @folio, starting from the page at index @si up to
> > + * and including the page at index @ei.
> 
> 'start' and 'end'?

Incorporated this.

> 
> > + *
> > + * The error handling from all failure points is consolidated to the
> > + * "store_pages_failed" label, based on the initialization of the zswap
> entries'
> > + * handles to ERR_PTR(-EINVAL) at allocation time, and the fact that the
> > + * entry's handle is subsequently modified only upon a successful
> zpool_malloc()
> > + * after the page is compressed.
> 
> This comment is not useful here. Instead comment on specific parts of
> the function where the intention of the code is not clear.

Done.

> 
> > + */
> > +static bool zswap_store_pages(struct folio *folio,
> > +			      long si,
> > +			      long ei,
> > +			      struct obj_cgroup *objcg,
> > +			      struct zswap_pool *pool,
> > +			      struct crypto_acomp_ctx *acomp_ctx,
> > +			      bool batching)
> >  {
> > -	swp_entry_t page_swpentry = page_swap_entry(page);
> > -	struct zswap_entry *entry, *old;
> > +	struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
> > +	int node_id = folio_nid(folio);
> > +	u8 i, from_i = 0, nr_pages = ei - si + 1;
> >
> > -	/* allocate entry */
> > -	entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
> > -	if (!entry) {
> > -		zswap_reject_kmemcache_fail++;
> > -		return false;
> > +	for (i = 0; i < nr_pages; ++i) {
> > +		entries[i] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
> > +
> > +		if (unlikely(!entries[i])) {
> > +			zswap_reject_kmemcache_fail++;
> > +			nr_pages = i;
> > +			goto store_pages_failed;
> > +		}
> > +
> > +		entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
> >  	}
> >
> > -	if (!zswap_compress(page, entry, pool))
> > -		goto compress_failed;
> > +	if (!zswap_batch_compress(folio, si, nr_pages, entries, pool,
> acomp_ctx, batching))
> > +		goto store_pages_failed;
> >
> > -	old = xa_store(swap_zswap_tree(page_swpentry),
> > -		       swp_offset(page_swpentry),
> > -		       entry, GFP_KERNEL);
> > -	if (xa_is_err(old)) {
> > -		int err = xa_err(old);
> > +	for (i = 0; i < nr_pages; ++i) {
> > +		swp_entry_t page_swpentry =
> page_swap_entry(folio_page(folio, si + i));
> > +		struct zswap_entry *old, *entry = entries[i];
> >
> > -		WARN_ONCE(err != -ENOMEM, "unexpected xarray error:
> %d\n", err);
> > -		zswap_reject_alloc_fail++;
> > -		goto store_failed;
> > -	}
> > +		old = xa_store(swap_zswap_tree(page_swpentry),
> > +			       swp_offset(page_swpentry),
> > +			       entry, GFP_KERNEL);
> > +		if (unlikely(xa_is_err(old))) {
> > +			int err = xa_err(old);
> >
> > -	/*
> > -	 * We may have had an existing entry that became stale when
> > -	 * the folio was redirtied and now the new version is being
> > -	 * swapped out. Get rid of the old.
> > -	 */
> > -	if (old)
> > -		zswap_entry_free(old);
> > +			WARN_ONCE(err != -ENOMEM, "unexpected xarray
> error: %d\n", err);
> > +			zswap_reject_alloc_fail++;
> > +			from_i = i;
> 
> 'error_idx' or 'store_fail_idx'?

Done.

> 
> > +			goto store_pages_failed;
> > +		}
> >
> > -	/*
> > -	 * The entry is successfully compressed and stored in the tree, there is
> > -	 * no further possibility of failure. Grab refs to the pool and objcg,
> > -	 * charge zswap memory, and increment zswap_stored_pages.
> > -	 * The opposite actions will be performed by zswap_entry_free()
> > -	 * when the entry is removed from the tree.
> > -	 */
> > -	zswap_pool_get(pool);
> > -	if (objcg) {
> > -		obj_cgroup_get(objcg);
> > -		obj_cgroup_charge_zswap(objcg, entry->length);
> > -	}
> > -	atomic_long_inc(&zswap_stored_pages);
> > +		/*
> > +		 * We may have had an existing entry that became stale when
> > +		 * the folio was redirtied and now the new version is being
> > +		 * swapped out. Get rid of the old.
> > +		 */
> > +		if (unlikely(old))
> > +			zswap_entry_free(old);
> >
> > -	/*
> > -	 * We finish initializing the entry while it's already in xarray.
> > -	 * This is safe because:
> > -	 *
> > -	 * 1. Concurrent stores and invalidations are excluded by folio lock.
> > -	 *
> > -	 * 2. Writeback is excluded by the entry not being on the LRU yet.
> > -	 *    The publishing order matters to prevent writeback from seeing
> > -	 *    an incoherent entry.
> > -	 */
> > -	entry->pool = pool;
> > -	entry->swpentry = page_swpentry;
> > -	entry->objcg = objcg;
> > -	entry->referenced = true;
> > -	if (entry->length) {
> > -		INIT_LIST_HEAD(&entry->lru);
> > -		zswap_lru_add(&zswap_list_lru, entry);
> > +		/*
> > +		 * The entry is successfully compressed and stored in the tree,
> there is
> > +		 * no further possibility of failure. Grab refs to the pool and
> objcg,
> > +		 * charge zswap memory, and increment
> zswap_stored_pages.
> > +		 * The opposite actions will be performed by
> zswap_entry_free()
> > +		 * when the entry is removed from the tree.
> > +		 */
> > +		zswap_pool_get(pool);
> > +		if (objcg) {
> > +			obj_cgroup_get(objcg);
> > +			obj_cgroup_charge_zswap(objcg, entry->length);
> > +		}
> > +		atomic_long_inc(&zswap_stored_pages);
> > +
> > +		/*
> > +		 * We finish initializing the entry while it's already in xarray.
> > +		 * This is safe because:
> > +		 *
> > +		 * 1. Concurrent stores and invalidations are excluded by folio
> lock.
> > +		 *
> > +		 * 2. Writeback is excluded by the entry not being on the LRU
> yet.
> > +		 *    The publishing order matters to prevent writeback from
> seeing
> > +		 *    an incoherent entry.
> > +		 */
> > +		entry->pool = pool;
> > +		entry->swpentry = page_swpentry;
> > +		entry->objcg = objcg;
> > +		entry->referenced = true;
> > +		if (likely(entry->length)) {
> > +			INIT_LIST_HEAD(&entry->lru);
> > +			zswap_lru_add(&zswap_list_lru, entry);
> > +		}
> >  	}
> >
> >  	return true;
> >
> > -store_failed:
> > -	zpool_free(pool->zpool, entry->handle);
> > -compress_failed:
> > -	zswap_entry_cache_free(entry);
> > +store_pages_failed:
> > +	for (i = from_i; i < nr_pages; ++i) {
> > +		if (!IS_ERR_VALUE(entries[i]->handle))
> > +			zpool_free(pool->zpool, entries[i]->handle);
> > +
> > +		zswap_entry_cache_free(entries[i]);
> > +	}
> > +
> >  	return false;
> >  }
> >
> > +/*
> > + * Store all pages in a folio by calling zswap_batch_compress().
> > + * If the compressor supports batching, i.e., has multiple acomp requests,
> > + * the folio will be compressed in batches of "acomp_ctx->nr_reqs" using
> > + * request chaining.
> > + * If the compressor has only one acomp request, the folio will be
> compressed
> > + * in batches of ZSWAP_MAX_BATCH_SIZE pages, where each page in the
> batch is
> > + * compressed sequentially.
> > + */
> > +static bool zswap_store_folio(struct folio *folio,
> > +			      struct obj_cgroup *objcg,
> > +			      struct zswap_pool *pool)
> > +{
> > +	long nr_pages = folio_nr_pages(folio);
> > +	struct crypto_acomp_ctx *acomp_ctx;
> > +	unsigned int batch_size;
> > +	bool ret = true, batching;
> > +	long si, ei;
> > +
> > +	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> > +
> > +	batching = ((acomp_ctx->nr_reqs > 1) && (nr_pages > 1));
> > +
> > +	batch_size = batching ? acomp_ctx->nr_reqs :
> ZSWAP_MAX_BATCH_SIZE;
> 
> acomp_ctx->nr_reqs is a function of the compressor. Can we store it in
> the pool to avoid having to hold the mutex here just for this?
> 
> This would also save a tiny bit of memory as right now we have it
> replicated for all CPUs.

Great suggestion! I have incorporated this in v9.

> 
> > +
> > +	if (!batching)
> > +		acomp_ctx_put_unlock(acomp_ctx);
> 
> So in the batching case we keep the mutex held throughout, otherwise we
> hold it for every batch. Why?
> 
> IIRC you mentioned that only holding it for every batch removes the zstd
> regressions, perhaps because holding the mutex for too long is a
> problem. But why not do the same for the batching case too?
> 
> Also, the current behavior is that we only hold the mutex during
> compression. Now we changed this to do it per-folio or per-batch. Does
> the current behavior cause regressions? Ideally we'd like to keep the
> mutex held as little as possible.
> 
> This is worth explaining in detail.

In v9, the mutex lock is held within zswap_compress() for the duration
of a batch. This was one of the optimizations that uniformly improved
performance for both, non-batching and batching compressors.
With the latest code, other options that I tried, such as locking per folio
or locking per page, caused regressions for zstd.

> 
> > +
> > +	/* Store the folio in batches of "batch_size" pages. */
> > +	for (si = 0, ei = min(si + batch_size - 1, nr_pages - 1);
> > +	     ((si < nr_pages) && (ei < nr_pages));
> > +	     si = ei + 1, ei = min(si + batch_size - 1, nr_pages - 1)) {
> 
> This is too complicated. Can we do:
> 
> 	for (start = 0; start < nr_pages; start += batch_size) {
> 		end = min(start + batch_size - 1, nr_pages - 1);
> 		...
> 	}
> 
> Also, we subtract 1 from ei here and add 1 in zswap_store_pages(), so
> perhaps it's better to make it exclusive:
> 
> 	for (start = 0; start < nr_pages; start += batch_size) {
> 		end = min(start + batch_size, nr_pages);
> 		...
> 	}
> 

Done.

> > +
> > +		if (!batching)
> > +			acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> > +
> > +		if (!zswap_store_pages(folio, si, ei, objcg, pool, acomp_ctx,
> batching)) {
> > +			ret = false;
> > +			break;
> > +		}
> > +
> > +		if (!batching)
> > +			acomp_ctx_put_unlock(acomp_ctx);
> > +	}
> > +
> > +	if (batching || !ret)
> > +		acomp_ctx_put_unlock(acomp_ctx);
> > +
> > +	return ret;
> > +}
> > +
> >  bool zswap_store(struct folio *folio)
> >  {
> >  	long nr_pages = folio_nr_pages(folio);
> > @@ -1667,7 +1815,6 @@ bool zswap_store(struct folio *folio)
> >  	struct mem_cgroup *memcg = NULL;
> >  	struct zswap_pool *pool;
> >  	bool ret = false;
> > -	long index;
> 
> These seems like an unrelated change.

I now have start/end, and reuse start in the error handling.

> 
> >
> >  	VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >  	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > @@ -1701,12 +1848,8 @@ bool zswap_store(struct folio *folio)
> >  		mem_cgroup_put(memcg);
> >  	}
> >
> > -	for (index = 0; index < nr_pages; ++index) {
> > -		struct page *page = folio_page(folio, index);
> > -
> > -		if (!zswap_store_page(page, objcg, pool))
> > -			goto put_pool;
> > -	}
> > +	if (!zswap_store_folio(folio, objcg, pool))
> > +		goto put_pool;
> 
> If we move the locking out of zswap_store_folio() and simplify the loop
> there, I belive it will be simple enough for us to inline it here.

Yes indeed. Incorporated in v9.

> 
> >
> >  	if (objcg)
> >  		count_objcg_events(objcg, ZSWPOUT, nr_pages);
> > @@ -1733,6 +1876,7 @@ bool zswap_store(struct folio *folio)
> >  		pgoff_t offset = swp_offset(swp);
> >  		struct zswap_entry *entry;
> >  		struct xarray *tree;
> > +		long index;
> >
> >  		for (index = 0; index < nr_pages; ++index) {
> >  			tree = swap_zswap_tree(swp_entry(type, offset +
> index));
> > --
> > 2.27.0
> >