[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=MRz7dL5=cP=4QN6uTgUJLh-xCt_zpfxytqZkYdycUN5w@mail.gmail.com>
Date: Tue, 3 Feb 2026 15:14:06 -0800
From: Nhat Pham <nphamcs@...il.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Sergey Senozhatsky <senozhatsky@...omium.org>, Andrew Morton <akpm@...ux-foundation.org>,
Minchan Kim <minchan@...nel.org>, Johannes Weiner <hannes@...xchg.org>,
Brian Geffon <bgeffon@...gle.com>, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
Herbert Xu <herbert@...dor.apana.org.au>
Subject: Re: [PATCH] zsmalloc: introduce SG-list based object read API
On Fri, Jan 16, 2026 at 11:53 AM Yosry Ahmed <yosry.ahmed@...ux.dev> wrote:
>
> On Tue, Jan 13, 2026 at 12:46:45PM +0900, Sergey Senozhatsky wrote:
> > Currently, zsmalloc performs address linearization on read
> > (which sometimes requires memcpy() to a local buffer).
> > Not all zsmalloc users need a linear address. For example,
> > Crypto API supports SG-list, performing linearization under
> > the hood, if needed. In addition, some compressors can have
> > native SG-list support, completely avoiding the linearization
> > step.
> >
> > Provide an SG-list based zsmalloc read API:
> > - zs_obj_read_sg_begin()
> > - zs_obj_read_sg_end()
> >
> > This API allows callers to obtain an SG representation
> > of the object (one entry for objects that are contained
> > in a single page and two entries for spanning objects),
> > avoiding the need for a bounce buffer and memcpy.
> >
> > Signed-off-by: Sergey Senozhatsky <senozhatsky@...omium.org>
> > Cc: Herbert Xu <herbert@...dor.apana.org.au>
> > ---
>
> I tested this with the zswap patch at the bottom, let me know if you
> want me to send separately on top, or you can include it in the next
> revision. Either way, feel free to add:
>
> Tested-by: Yosry Ahmed <yosry.ahmed@...ux.dev>
>
> > include/linux/zsmalloc.h | 4 +++
> > mm/zsmalloc.c | 65 ++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 69 insertions(+)
> >
> > diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> > index 5565c3171007..11e614663dd3 100644
> > --- a/include/linux/zsmalloc.h
> > +++ b/include/linux/zsmalloc.h
> > @@ -22,6 +22,7 @@ struct zs_pool_stats {
> > };
> >
> > struct zs_pool;
> > +struct scatterlist;
> >
> > struct zs_pool *zs_create_pool(const char *name);
> > void zs_destroy_pool(struct zs_pool *pool);
> > @@ -43,6 +44,9 @@ void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
> > size_t mem_len, void *local_copy);
> > void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
> > size_t mem_len, void *handle_mem);
> > +int zs_obj_read_sg_begin(struct zs_pool *pool, unsigned long handle,
>
> void? The return value is always 0.
>
> > + struct scatterlist *sg, size_t mem_len);
> > +void zs_obj_read_sg_end(struct zs_pool *pool, unsigned long handle);
> > void zs_obj_write(struct zs_pool *pool, unsigned long handle,
> > void *handle_mem, size_t mem_len);
> >
> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> > index 16d5587a052a..5abb8bc0956a 100644
> > --- a/mm/zsmalloc.c
> > +++ b/mm/zsmalloc.c
> > @@ -30,6 +30,7 @@
> > #include <linux/highmem.h>
> > #include <linux/string.h>
> > #include <linux/slab.h>
> > +#include <linux/scatterlist.h>
> > #include <linux/spinlock.h>
> > #include <linux/sprintf.h>
> > #include <linux/shrinker.h>
> > @@ -1146,6 +1147,70 @@ void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
> > }
> > EXPORT_SYMBOL_GPL(zs_obj_read_end);
> >
> > +int zs_obj_read_sg_begin(struct zs_pool *pool, unsigned long handle,
> > + struct scatterlist *sg, size_t mem_len)
> > +{
> > + struct zspage *zspage;
> > + struct zpdesc *zpdesc;
> > + unsigned long obj, off;
> > + unsigned int obj_idx;
> > + struct size_class *class;
> > +
> > + /* Guarantee we can get zspage from handle safely */
> > + read_lock(&pool->lock);
> > + obj = handle_to_obj(handle);
> > + obj_to_location(obj, &zpdesc, &obj_idx);
> > + zspage = get_zspage(zpdesc);
> > +
> > + /* Make sure migration doesn't move any pages in this zspage */
> > + zspage_read_lock(zspage);
> > + read_unlock(&pool->lock);
> > +
> > + class = zspage_class(pool, zspage);
> > + off = offset_in_page(class->size * obj_idx);
>
> There is a lot of duplication between this and zs_obj_read_begin(). I
> wanted to create a common helper for them both that returns the zpdesc
> and offset, but we cannot do the same on the read end side as the unlock
> needs to happen after kunmap() in zs_obj_read_end().
>
> Putting parts of this code in helpers makes it a bit obscure due to the
> locking rules :/
>
> I wonder if we can drop zs_obj_read_*() and move the spanning logic into
> zram. Looking at zram code, seems like read_from_zspool_raw() and
> read_incompressible_page() just copy the return address, so I think they
> can trivially move to using the SG list helpers and
> memcpy_from_sglist().
>
> The only non-trivial caller is read_compressed_page(), because it passes
> the compressed object to zcomp. So I think we only need to handle the
> linearization there, something like this (completely untested):
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 61d3e2c74901..6872819184d9 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -2073,6 +2073,7 @@ static int read_incompressible_page(struct zram *zram, struct page *page,
>
> static int read_compressed_page(struct zram *zram, struct page *page, u32 index)
> {
> + struct scatterlist src_sg[2];
> struct zcomp_strm *zstrm;
> unsigned long handle;
> unsigned int size;
> @@ -2084,12 +2085,23 @@ static int read_compressed_page(struct zram *zram, struct page *page, u32 index)
> prio = get_slot_comp_priority(zram, index);
>
> zstrm = zcomp_stream_get(zram->comps[prio]);
> - src = zs_obj_read_begin(zram->mem_pool, handle, size,
> - zstrm->local_copy);
> + zs_obj_read_sg_begin(zram->mem_pool, handle, src_sg, size);
> +
> + if (sg_nents(src_sg) == 1) {
> + src = kmap_local_page(sg_page(src_sg)) + src_sg->offset;
> + } else {
> + memcpy_from_sglist(zstrm->local_copy, src_sg, 0, size);
> + src = zstrm->local_copy;
> + }
> +
> dst = kmap_local_page(page);
> ret = zcomp_decompress(zram->comps[prio], zstrm, src, size, dst);
> kunmap_local(dst);
> - zs_obj_read_end(zram->mem_pool, handle, size, src);
> +
> + if (src != zstrm->local_copy)
> + kunmap_local(src);
> +
> + zs_obj_read_sg_end(zram->mem_pool, handle);
> zcomp_stream_put(zstrm);
>
> return ret;
>
> ---
>
> If this is too ugly we can even put it in helpers like
> linearize_sg_list() and delinearize_sg_list(), that take in the SG list
> and zstrm, and figure out if mapping or copying to the buffer is needed.
>
> > +
> > + if (!ZsHugePage(zspage))
> > + off += ZS_HANDLE_SIZE;
> > +
> > + if (off + mem_len <= PAGE_SIZE) {
> > + /* this object is contained entirely within a page */
> > + sg_init_table(sg, 1);
> > + sg_set_page(sg, zpdesc_page(zpdesc), mem_len, off);
> > + } else {
> > + size_t sizes[2];
> > +
> > + /* this object spans two pages */
> > + sizes[0] = PAGE_SIZE - off;
> > + sizes[1] = mem_len - sizes[0];
> > +
> > + sg_init_table(sg, 2);
> > + sg_set_page(sg, zpdesc_page(zpdesc), sizes[0], off);
> > +
> > + zpdesc = get_next_zpdesc(zpdesc);
> > + sg = sg_next(sg);
> > +
> > + sg_set_page(sg, zpdesc_page(zpdesc), sizes[1], 0);
> > + }
> > +
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(zs_obj_read_sg_begin);
> > +
> > +void zs_obj_read_sg_end(struct zs_pool *pool, unsigned long handle)
> > +{
> > + struct zspage *zspage;
> > + struct zpdesc *zpdesc;
> > + unsigned long obj;
> > + unsigned int obj_idx;
> > +
> > + obj = handle_to_obj(handle);
> > + obj_to_location(obj, &zpdesc, &obj_idx);
> > + zspage = get_zspage(zpdesc);
> > +
> > + zspage_read_unlock(zspage);
> > +}
> > +EXPORT_SYMBOL_GPL(zs_obj_read_sg_end);
> > +
> > void zs_obj_write(struct zs_pool *pool, unsigned long handle,
> > void *handle_mem, size_t mem_len)
> > {
> > --
> > 2.52.0.457.g6b5491de43-goog
> >
>
> Zswap patch:
>
> From 8b7c41efce7635dd14db9586c3bb96c861298772 Mon Sep 17 00:00:00 2001
> From: Yosry Ahmed <yosry.ahmed@...ux.dev>
> Date: Fri, 16 Jan 2026 19:20:02 +0000
> Subject: [PATCH] mm: zswap: Use SG list decompression APIs from zsmalloc
>
> Use the new zs_obj_read_sg_*() APIs in zswap_decompress(), instead of
> zs_obj_read_*() APIs returning a linear address. The SG list is passed
> directly to the crypto API, simplifying the logic and dropping the
> workaround that copies highmem addresses to a buffer. The crypto API
> should internally linearize the SG list if needed.
>
> Zsmalloc fills an SG list up to 2 entries in size, so change the input
> SG list to fit 2 entries.
>
> Update the incompressible entries path to use memcpy_from_sglist() to
> copy the data to the folio. Opportunistically set dlen to PAGE_SIZE in
> the same code path (rather that at the top of the function) to make it
> clearer.
>
> Drop the goto in zswap_compress() as the code now is not simple enough
> for an if-else statement instead. Rename 'decomp_ret' to 'ret' and reuse
> it to keep the intermediate return value of crypto_acomp_decompress() to
> keep line lengths manageable.
>
> No functional change intended.
>
> Signed-off-by: Yosry Ahmed <yosry.ahmed@...ux.dev>
> ---
> mm/zswap.c | 49 +++++++++++++++++++------------------------------
> 1 file changed, 19 insertions(+), 30 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 2f410507cbc8..aea3267c5a96 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -26,6 +26,7 @@
> #include <linux/mempolicy.h>
> #include <linux/mempool.h>
> #include <crypto/acompress.h>
> +#include <crypto/scatterwalk.h>
> #include <linux/zswap.h>
> #include <linux/mm_types.h>
> #include <linux/page-flags.h>
> @@ -936,53 +937,41 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> {
> struct zswap_pool *pool = entry->pool;
> - struct scatterlist input, output;
> + struct scatterlist input[2]; /* zsmalloc returns an SG list 1-2 entries */
> + struct scatterlist output;
> struct crypto_acomp_ctx *acomp_ctx;
> - int decomp_ret = 0, dlen = PAGE_SIZE;
> - u8 *src, *obj;
> + int ret = 0, dlen;
>
> acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> - obj = zs_obj_read_begin(pool->zs_pool, entry->handle, entry->length,
> - acomp_ctx->buffer);
> + zs_obj_read_sg_begin(pool->zs_pool, entry->handle, input, entry->length);
>
> /* zswap entries of length PAGE_SIZE are not compressed. */
> if (entry->length == PAGE_SIZE) {
> - memcpy_to_folio(folio, 0, obj, entry->length);
> - goto read_done;
> - }
> -
> - /*
> - * zs_obj_read_begin() might return a kmap address of highmem when
> - * acomp_ctx->buffer is not used. However, sg_init_one() does not
> - * handle highmem addresses, so copy the object to acomp_ctx->buffer.
> - */
> - if (virt_addr_valid(obj)) {
> - src = obj;
> + WARN_ON_ONCE(input->length != PAGE_SIZE);
> + memcpy_from_sglist(kmap_local_folio(folio, 0), input, 0, PAGE_SIZE);
> + dlen = PAGE_SIZE;
> } else {
> - WARN_ON_ONCE(obj == acomp_ctx->buffer);
> - memcpy(acomp_ctx->buffer, obj, entry->length);
> - src = acomp_ctx->buffer;
> + sg_init_table(&output, 1);
> + sg_set_folio(&output, folio, PAGE_SIZE, 0);
> + acomp_request_set_params(acomp_ctx->req, input, &output,
> + entry->length, PAGE_SIZE);
> + ret = crypto_acomp_decompress(acomp_ctx->req);
> + ret = crypto_wait_req(ret, &acomp_ctx->wait);
> + dlen = acomp_ctx->req->dlen;
> }
>
> - sg_init_one(&input, src, entry->length);
> - sg_init_table(&output, 1);
> - sg_set_folio(&output, folio, PAGE_SIZE, 0);
> - acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
> - decomp_ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
> - dlen = acomp_ctx->req->dlen;
> -
> -read_done:
> - zs_obj_read_end(pool->zs_pool, entry->handle, entry->length, obj);
> + zs_obj_read_sg_end(pool->zs_pool, entry->handle);
> acomp_ctx_put_unlock(acomp_ctx);
>
> - if (!decomp_ret && dlen == PAGE_SIZE)
> + if (!ret && dlen == PAGE_SIZE)
> return true;
>
> zswap_decompress_fail++;
> pr_alert_ratelimited("Decompression error from zswap (%d:%lu %s %u->%d)\n",
> swp_type(entry->swpentry),
> swp_offset(entry->swpentry),
> - entry->pool->tfm_name, entry->length, dlen);
> + entry->pool->tfm_name,
> + entry->length, dlen);
Just style change? :)
> return false;
> }
>
> --
> 2.52.0.457.g6b5491de43-goog
>
Looks like the new zsmalloc sglist-based API has landed in mm-stable.
With that, the zswap API conversion LGTM.
Acked-by: Nhat Pham <nphamcs@...il.com>
Powered by blists - more mailing lists