linux-kernel - Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <v5veb673xrz6z3tevfdymhuik7nltojrvcqyjih4ds7co4p4hr@5e7ngdkrxo32>
Date: Mon, 5 Jan 2026 15:58:42 +0000
From: Yosry Ahmed <yosry.ahmed@...ux.dev>
To: Sergey Senozhatsky <senozhatsky@...omium.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>, 
	Nhat Pham <nphamcs@...il.com>, Minchan Kim <minchan@...nel.org>, 
	Johannes Weiner <hannes@...xchg.org>, Brian Geffon <bgeffon@...gle.com>, linux-kernel@...r.kernel.org, 
	Herbert Xu <herbert@...dor.apana.org.au>, linux-mm@...ck.org
Subject: Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should
 consider other metrics

On Mon, Jan 05, 2026 at 10:42:51AM +0900, Sergey Senozhatsky wrote:
> On (26/01/02 18:29), Yosry Ahmed wrote:
> > On Thu, Jan 01, 2026 at 10:38:14AM +0900, Sergey Senozhatsky wrote:
> [..]
> > 
> > I worry that the heuristics are too hand-wavy
> 
> I don't disagree.  Am not super excited about the heuristics either.
> 
> > and I wonder if the memcpy savings actually show up as perf improvements
> > in any real life workload. Do we have data about this?
> 
> I don't have real life 16K PAGE_SIZE devices.  However, on 16K PAGE_SIZE
> systems we have "normal" size-classes up to a very large size, and normal
> class means chaining of 0-order physical pages, and chaining means spanning.
> So on 16K memcpy overhead is expected to be somewhat noticeable.

I don't disagree that it could be a problem, I am just against
optimizations without data. It makes it hard to modify these heuristics
later or remove them, since we don't really know what effect they had in
the first place.

We also don't know if the 0.5% increase in memory usage is actually
offset by CPU gains.

> 
> > I also vaguely recall discussions about other ways to avoid the memcpy
> > using scatterlists, so I am wondering if this is the right metric to
> > optimize.
> 
> As far as I understand SG-list based approach is that it will require
> implementing split-data handling on the compression algorithms side,
> which is not trivial (especially if the only reason to do that is
> zsmalloc).

I am not sure tbh, adding Herbert here. I remember looking at the code
in scomp_acomp_comp_decomp() at some point, and I think it will take
care of non-contiguous SG-lists. Not sure if that's the correct place to
look tho.

> 
> Alternatively, we maybe can try to vmap spanning objects:

Using vmap makes sense in theory, but in practice (at least for zswap)
it doesn't help because SG lists do not support vmap addresses. Zswap
will actually treat them the same as highmem and copy them to a buffer
before putting them in an SG list, so we effectively just do the
memcpy() in zswap instead of zsmalloc.

> 
> ---
>  mm/zsmalloc.c | 24 +++++++++++++-----------
>  1 file changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 6fc216ab8190..4a68c27cb5d4 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -38,6 +38,7 @@
>  #include <linux/zsmalloc.h>
>  #include <linux/fs.h>
>  #include <linux/workqueue.h>
> +#include <linux/vmalloc.h>
>  #include "zpdesc.h"
>  
>  #define ZSPAGE_MAGIC	0x58
> @@ -1097,19 +1098,15 @@ void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
>  		addr = kmap_local_zpdesc(zpdesc);
>  		addr += off;
>  	} else {
> -		size_t sizes[2];
> +		struct page *pages[2];
>  
>  		/* this object spans two pages */
> -		sizes[0] = PAGE_SIZE - off;
> -		sizes[1] = class->size - sizes[0];
> -		addr = local_copy;
> -
> -		memcpy_from_page(addr, zpdesc_page(zpdesc),
> -				 off, sizes[0]);
> -		zpdesc = get_next_zpdesc(zpdesc);
> -		memcpy_from_page(addr + sizes[0],
> -				 zpdesc_page(zpdesc),
> -				 0, sizes[1]);
> +		pages[0] = zpdesc_page(zpdesc);
> +		pages[1] = zpdesc_page(get_next_zpdesc(zpdesc));
> +		addr = vm_map_ram(pages, 2, NUMA_NO_NODE);
> +		if (!addr)
> +			return NULL;
> +		addr += off;
>  	}
>  
>  	if (!ZsHugePage(zspage))
> @@ -1139,6 +1136,11 @@ void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
>  			off += ZS_HANDLE_SIZE;
>  		handle_mem -= off;
>  		kunmap_local(handle_mem);
> +	} else {
> +		if (!ZsHugePage(zspage))
> +			off += ZS_HANDLE_SIZE;
> +		handle_mem -= off;
> +		vm_unmap_ram(handle_mem, 2);
>  	}
>  
>  	zspage_read_unlock(zspage);
> -- 
> 2.52.0.351.gbe84eed79e-goog
> 
> 
> > What are the main pain points for PAGE_SIZE > 4K configs? Is it the
> > compression/decompression time? In my experience this is usually not the
> > bottleneck, I would imagine the real problem would be the internal
> > fragmentation.
> 
> Right, internal fragmentation can be the main problem.