[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ2PR11MB84727C029B980E5E06F7291EC9CAA@SJ2PR11MB8472.namprd11.prod.outlook.com>
Date: Fri, 14 Nov 2025 19:23:42 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>, SeongJae Park <sj@...nel.org>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
<hannes@...xchg.org>, "nphamcs@...il.com" <nphamcs@...il.com>,
"chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com"
<ryan.roberts@....com>, "21cnbao@...il.com" <21cnbao@...il.com>,
"ying.huang@...ux.alibaba.com" <ying.huang@...ux.alibaba.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"senozhatsky@...omium.org" <senozhatsky@...omium.org>, "sj@...nel.org"
<sj@...nel.org>, "kasong@...cent.com" <kasong@...cent.com>,
"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
"herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>,
"davem@...emloft.net" <davem@...emloft.net>, "clabbe@...libre.com"
<clabbe@...libre.com>, "ardb@...nel.org" <ardb@...nel.org>,
"ebiggers@...gle.com" <ebiggers@...gle.com>, "surenb@...gle.com"
<surenb@...gle.com>, "Accardi, Kristen C" <kristen.c.accardi@...el.com>,
"Gomes, Vinicius" <vinicius.gomes@...el.com>, "Feghali, Wajdi K"
<wajdi.k.feghali@...el.com>, "Gopal, Vinodh" <vinodh.gopal@...el.com>,
"Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
compress batching of large folios.
> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@...ux.dev>
> Sent: Friday, November 14, 2025 7:38 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>; SeongJae Park
> <sj@...nel.org>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; nphamcs@...il.com; chengming.zhou@...ux.dev;
> usamaarif642@...il.com; ryan.roberts@....com; 21cnbao@...il.com;
> ying.huang@...ux.alibaba.com; akpm@...ux-foundation.org;
> senozhatsky@...omium.org; sj@...nel.org; kasong@...cent.com; linux-
> crypto@...r.kernel.org; herbert@...dor.apana.org.au;
> davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> <kristen.c.accardi@...el.com>; Gomes, Vinicius <vinicius.gomes@...el.com>;
> Feghali, Wajdi K <wajdi.k.feghali@...el.com>; Gopal, Vinodh
> <vinodh.gopal@...el.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
>
> On Fri, Nov 14, 2025 at 06:43:21AM +0000, Sridhar, Kanchana P wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@...ux.dev>
> > > Sent: Thursday, November 13, 2025 9:52 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> > > Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> > > hannes@...xchg.org; nphamcs@...il.com;
> chengming.zhou@...ux.dev;
> > > usamaarif642@...il.com; ryan.roberts@....com; 21cnbao@...il.com;
> > > ying.huang@...ux.alibaba.com; akpm@...ux-foundation.org;
> > > senozhatsky@...omium.org; sj@...nel.org; kasong@...cent.com; linux-
> > > crypto@...r.kernel.org; herbert@...dor.apana.org.au;
> > > davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> > > ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> > > <kristen.c.accardi@...el.com>; Gomes, Vinicius
> <vinicius.gomes@...el.com>;
> > > Feghali, Wajdi K <wajdi.k.feghali@...el.com>; Gopal, Vinodh
> > > <vinodh.gopal@...el.com>
> > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress()
> with
> > > compress batching of large folios.
> > >
> > > On Thu, Nov 13, 2025 at 11:55:10PM +0000, Sridhar, Kanchana P wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: Yosry Ahmed <yosry.ahmed@...ux.dev>
> > > > > Sent: Thursday, November 13, 2025 1:35 PM
> > > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> > > > > Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> > > > > hannes@...xchg.org; nphamcs@...il.com;
> > > chengming.zhou@...ux.dev;
> > > > > usamaarif642@...il.com; ryan.roberts@....com;
> 21cnbao@...il.com;
> > > > > ying.huang@...ux.alibaba.com; akpm@...ux-foundation.org;
> > > > > senozhatsky@...omium.org; sj@...nel.org; kasong@...cent.com;
> linux-
> > > > > crypto@...r.kernel.org; herbert@...dor.apana.org.au;
> > > > > davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> > > > > ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> > > > > <kristen.c.accardi@...el.com>; Gomes, Vinicius
> > > <vinicius.gomes@...el.com>;
> > > > > Feghali, Wajdi K <wajdi.k.feghali@...el.com>; Gopal, Vinodh
> > > > > <vinodh.gopal@...el.com>
> > > > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched
> zswap_compress()
> > > with
> > > > > compress batching of large folios.
> > > > >
> > > [..]
> > > > > > + /*
> > > > > > + * If a page cannot be compressed into a size smaller
> than
> > > > > > + * PAGE_SIZE, save the content as is without a
> compression,
> > > > > to
> > > > > > + * keep the LRU order of writebacks. If writeback is
> disabled,
> > > > > > + * reject the page since it only adds metadata
> overhead.
> > > > > > + * swap_writeout() will put the page back to the
> active LRU
> > > > > list
> > > > > > + * in the case.
> > > > > > + *
> > > > > > + * It is assumed that any compressor that sets the
> output
> > > > > length
> > > > > > + * to 0 or a value >= PAGE_SIZE will also return a
> negative
> > > > > > + * error status in @err; i.e, will not return a successful
> > > > > > + * compression status in @err in this case.
> > > > > > + */
> > > > >
> > > > > Ugh, checking the compression error and checking the compression
> length
> > > > > are now in separate places so we need to check if writeback is
> disabled
> > > > > in separate places and store the page as-is. It's ugly, and I think the
> > > > > current code is not correct.
> > > >
> > > > The code is 100% correct. You need to spend more time understanding
> > > > the code. I have stated my assumption above in the comments to
> > > > help in understanding the "why".
> > > >
> > > > From a maintainer, I would expect more responsible statements than
> > > > this. A flippant remark made without understanding the code (and,
> > > > disparaging the comments intended to help you do this), can impact
> > > > someone's career. I am held accountable in my job based on your
> > > > comments.
> > > >
> > > > That said, I have worked tirelessly and innovated to make the code
> > > > compliant with Herbert's suggestions (which btw have enabled an
> > > > elegant batching implementation and code commonality for IAA and
> > > > software compressors), validated it thoroughly for IAA and ZSTD to
> > > > ensure that both demonstrate performance improvements, which
> > > > are crucial for memory savings. I am proud of this work.
> > > >
> > > >
> > > > >
> > > > > > + if (err && !wb_enabled)
> > > > > > + goto compress_error;
> > > > > > +
> > > > > > + for_each_sg(acomp_ctx->sg_outputs->sgl, sg,
> nr_comps, k) {
> > > > > > + j = k + i;
> > > > >
> > > > > Please use meaningful iterator names rather than i, j, and k and the
> huge
> > > > > comment explaining what they are.
> > > >
> > > > I happen to have a different view: having longer iterator names firstly
> makes
> > > > code seem "verbose" and detracts from readability, not to mention
> > > exceeding the
> > > > 80-character line limit. The comments are essential for code
> maintainability
> > > > and avoid out-of-bounds errors when the next zswap developer wants
> to
> > > > optimize the code.
> > > >
> > > > One drawback of i/j/k iterators is mis-typing errors which cannot be
> caught
> > > > at compile time. Let me think some more about how to strike a good
> > > balance.
> > > >
> > > > >
> > > > > > + dst = acomp_ctx->buffers[k];
> > > > > > + dlen = sg->length | *errp;
> > > > >
> > > > > Why are we doing this?
> > > > >
> > > > > > +
> > > > > > + if (dlen < 0) {
> > > > >
> > > > > We should do the incompressible page handling also if dlen is
> PAGE_SIZE,
> > > > > or if the compression failed (I guess that's the intention of bit OR'ing
> > > > > with *errp?)
> > > >
> > > > Yes, indeed: that's the intention of bit OR'ing with *errp.
> > >
> > > ..and you never really answered my question. In the exising code we
> > > store the page as incompressible if writeback is enabled AND
> > > crypto_wait_req() fails or dlen is zero or PAGE_SIZE. We check above
> > > if crypto_wait_req() fails and writeback is disabled, but what about the
> > > rest?
> >
> > Let me explain this some more. The new code only relies on the assumption
> > that if dlen is zero or >= PAGE_SIZE, the compressor will not return a 0
> > ("successful status"). In other words, the compressor will return an error
> status
> > in this case, which is expected to be a negative error code.
>
> I am not sure if all compressors do that, especially for the case where
> dlen >= PAGE_SIZE. The existing code does not assume that there will be
> an error code in these cases.
>
> For the dlen == 0 case, the check was introduced recently by commit
> dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression failed page
> as-is"). Looking through the history it seems like it was introduced in
> v4 of that patch but I don't see the reasoning.
The existing code did not check for dlen == 0 and dlen >= PAGE_SIZE
prior to commit dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression
failed page as-is") either. We need SeongJae or Herbert to clarify whether
this check is needed, or if it is sufficient to rely on comp_ret, the return from
crypto_wait_req().
>
> SeongJae, did you observe any compressors returning dlen == 0 but no
> error code? What was the reasoning behind the dlen == 0 check?
>
> >
> > Under these (hopefully valid) assumptions, the code handles the simple case
> > of an error compression return status and writeback is disabled, by the
> > "goto compress_error".
> >
> > The rest is handled by these:
> >
> > 1) First, I need to adapt the use of sg_outputs->sgl->length to represent the
> > compress length for software compressors, so I do this after
> crypto_wait_req()
> > returns:
> >
> > acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req->dlen;
>
> For SW compressors, why is acomp_ctx->sg_outputs->sgl->length not set?
> IIUC we are using the same API for SW and HW compressors, why is the
> output length in different places for each of them?
This is to first implement the SG lists batching interface in iaa_crypto, while
maintaining backward compatibility for SW compressors with the new API.
I believe we may want to adapt the crypto API to SW compressors
at a later point. I also believe this would be outside the scope of this patch.
It would be nice if Herbert can share his vision on this aspect.
>
> >
> > I did not want to propose any changes to crypto software compressors
> protocols.
> >
> > 2) After the check for the "if (err && !wb_enabled)" case, the new code has
> this:
> >
> > for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > j = k + i;
> > dst = acomp_ctx->buffers[k];
> > dlen = sg->length | *errp;
> >
> > if (dlen < 0) {
> > dlen = PAGE_SIZE;
> > dst = kmap_local_page(folio_page(folio, start + j));
> > }
> >
> > For batching compressors, namely, iaa_crypto, the individual output SG
> > lists sg->length follows the requirements from Herbert: each sg->length
> > is the compressed length or the error status (a negative error code).
> >
> > Then all I need to know whether to store the page as incompressible
> > is to either directly test if sg->length is negative (for batching compressors),
> > or sg->length bit-OR'ed with the crypto_wait_req() return status ("err")
> > is negative. This is accomplished by the "dlen = sg->length | *errp;".
> >
> > I believe this maintains backward compatibility with the existing code.
> > Please let me know if you agree.
>
> For batching compressors, will 'err' be set as well, or just sg->length?
> If it's just sg->length, then we need to check again if WB is enabled
> here before storing the page uncompressed. Right?
iaa_crypto will set 'err' and set the sg->length as per the batching interface
spec from Herbert.
>
> >
> > >
> > > We don't check again if writeback is enabled before storing the page is
> > > incompressible, and we do not check if dlen is zero or PAGE_SIZE. Are
> > > these cases no longer possible?
> >
> > Hope the above explanation clarifies things some more? These case
> > are possible, and as long as they return an error status, they should be
> > correctly handled by the new code.
>
> As mentioned above, I am not sure if that's correct for dlen >=
> PAGE_SIZE.
We need to get clarity on this from SeongJae/Herbert.
>
> >
> > >
> > > Also, why use errp, why not explicitly use the appropriate error code?
> > > It's also unclear to me why the error code is always zero with HW
> > > compression?
> >
> > This is because of the sg->length requirements (compressed length/error)
> > for the batching interface suggested by Herbert. Hence, I upfront define
> > err_sg to 0, and, set errp to &err_sg for batching compressors. For software
> > compressors, errp is set to &err, namely, the above check will always apply
> > the software compressor's error status to the compressed length via
> > the bit-OR to determine if the page needs to be stored uncompressed.
>
> Thanks for the clarification. I understand that the error code has
> different sources for SW and HW compressors, but I do not like using
> errp as an indirection. It makes the code unclear. I would rather we
> explicitly check err for SW compressors and dlen for HW compressors.
>
> That being said, I thought what Herbert suggested was that the same API
> is used for both SW and HW compressors. IOW, either way we submit a
> batch of pages (8 pages for SW compressors), and then the crypto API
> would either give the entire batch to the compressor if it supports
> batching, or loop over them internally and hand them page-by-page to
> the compressor.
That was not how I understood Herbert's suggestion for the batching interface.
He did suggest the following:
"Before the call to acomp, the destination SG list should contain as
many elements as the number of units. On return, the dst lengths
should be stored in each destination SG entry."
I have incorporated this suggestion in the iaa_crypto driver. For SW
compressors, I have tried not to propose any API changes, while making
sure that the zswap changes for the SG lists batching API work as expected
for SW without too much special-casing code.
I suppose I always assumed that we would update SW compressors later,
and not as part of this patch-set.
>
> This would simplify usage as we do not have to handle the differences in
> zswap.
That's the nice thing about SG lists - I think the zswap_compress() calls to
the new batching API appears agnostic to SW and HW compressors.
Other than the upfront "errp = (pool->compr_batch_size == 1) ? &err : &err_sg;"
the logical code organization of the new zswap_compress() is quite similar to
the existing code. The post-compress "dlen = sg->length | *errp;" handles the rest.
>
> If that is not doable, at the very least the API should be consistent.
> Right now the error code and length are propagated differently to the
> caller based on whether or not the compressor support batching.
Hopefully this minor difference is transitional while we move zswap to
use the new batching interface, with the assumption that crypto SW API
can be updated later. We would need to get Herbert's thoughts on this.
>
> >
> >
> > >
> > > >
> > > > >
> > > > > > + dlen = PAGE_SIZE;
> > > > > > + dst =
> kmap_local_page(folio_page(folio, start
> > > > > + j));
> > > > > > + }
> > > > > > +
> > > > > > + handle = zs_malloc(pool->zs_pool, dlen, gfp,
> nid);
> > > > > >
> > > > > > - zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > > > > > - entry->handle = handle;
> > > > > > - entry->length = dlen;
> > > > > > + if (IS_ERR_VALUE(handle)) {
> > > > > > + if (PTR_ERR((void *)handle) == -
> ENOSPC)
> > > > > > +
> zswap_reject_compress_poor++;
> > > > > > + else
> > > > > > + zswap_reject_alloc_fail++;
> > > > > >
> > > > > > -unlock:
> > > > > > - if (mapped)
> > > > > > - kunmap_local(dst);
> > > > > > - if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> > > > > > - zswap_reject_compress_poor++;
> > > > > > - else if (comp_ret)
> > > > > > - zswap_reject_compress_fail++;
> > > > > > - else if (alloc_ret)
> > > > > > - zswap_reject_alloc_fail++;
> > > > > > + goto err_unlock;
> > > > > > + }
> > > > > > +
> > > > > > + zs_obj_write(pool->zs_pool, handle, dst,
> dlen);
> > > > > > + entries[j]->handle = handle;
> > > > > > + entries[j]->length = dlen;
> > > > > > + if (dst != acomp_ctx->buffers[k])
> > > > > > + kunmap_local(dst);
> > > > > > + }
> > > > > > + } /* finished compress and store nr_pages. */
> > > > > > +
> > > > > > + mutex_unlock(&acomp_ctx->mutex);
> > > > > > + return true;
> > > > > > +
> > > > > > +compress_error:
> > > > > > + for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > > > > + if ((int)sg->length < 0) {
> > > > > > + if ((int)sg->length == -ENOSPC)
> > > > > > + zswap_reject_compress_poor++;
> > > > > > + else
> > > > > > + zswap_reject_compress_fail++;
> > > > > > + }
> > > > > > + }
> > > > > >
> > > > > > +err_unlock:
> > > > > > mutex_unlock(&acomp_ctx->mutex);
> > > > > > - return comp_ret == 0 && alloc_ret == 0;
> > > > > > + return false;
> > > > > > }
> > > > > >
> > > > > > static bool zswap_decompress(struct zswap_entry *entry, struct
> folio
> > > > > *folio)
> > > > > > @@ -1488,12 +1604,9 @@ static bool zswap_store_pages(struct
> folio
> > > > > *folio,
> > > > > > INIT_LIST_HEAD(&entries[i]->lru);
> > > > > > }
> > > > > >
> > > > > > - for (i = 0; i < nr_pages; ++i) {
> > > > > > - struct page *page = folio_page(folio, start + i);
> > > > > > -
> > > > > > - if (!zswap_compress(page, entries[i], pool,
> wb_enabled))
> > > > > > - goto store_pages_failed;
> > > > > > - }
> > > > > > + if (unlikely(!zswap_compress(folio, start, nr_pages, entries,
> pool,
> > > > > > + nid, wb_enabled)))
> > > > > > + goto store_pages_failed;
> > > > > >
> > > > > > for (i = 0; i < nr_pages; ++i) {
> > > > > > struct zswap_entry *old, *entry = entries[i];
> > > > > > --
> > > > > > 2.27.0
> > > > > >
Powered by blists - more mailing lists