lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ddfs43qldaws5urlnpah3ibp5xeu7st37p5hgdfajvdtwor4sd@fkcm3brinygo>
Date: Fri, 2 Jan 2026 18:29:56 +0000
From: Yosry Ahmed <yosry.ahmed@...ux.dev>
To: Sergey Senozhatsky <senozhatsky@...omium.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>, 
	Nhat Pham <nphamcs@...il.com>, Minchan Kim <minchan@...nel.org>, 
	Johannes Weiner <hannes@...xchg.org>, Brian Geffon <bgeffon@...gle.com>, linux-kernel@...r.kernel.org, 
	linux-mm@...ck.org
Subject: Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should
 consider other metrics

On Thu, Jan 01, 2026 at 10:38:14AM +0900, Sergey Senozhatsky wrote:
> This is the first step towards re-thinking optimization strategy
> during chain-size (the number of 0-order physical pages a zspage
> chains for most optimal performance) configuration. Currently,
> we only consider one metric - "wasted" memory - and try various
> chain length configurations in order to find the minimal wasted
> space configuration.  However, this strategy doesn't consider
> the fact that our optimization space is not single-dimensional.
> When we increase zspage chain length we at the same increase the
> number of spanning objects (objects that span two physical pages).
> Such objects slow down read() operations because zsmalloc needs to
> kmap both pages and memcpy objects' chunks.  This clearly increases
> CPU usage and battery drain.
> 
> We, most likely, need to consider numerous metrics and optimize
> in a multi-dimensional space.  These can be wired in later on, for
> now we just add some heuristic to increase zspage chain length only
> if there are substantial savings memory usage wise.  We can tune
> these threshold values (there is a simple user-space tool [2] to
> experiment with those knobs), but what we currently is already
> interesting enough.  Where does this bring us, using a synthetic
> test [1], which produces byte-to-byte comparable workloads, on a
> 4K PAGE_SIZE, chain size 10 system:
> 
> BASE
> ====
>  zsmalloc_test: num write objects: 339598
>  zsmalloc_test: pool pages used 175111, total allocated size 698213488
>  zsmalloc_test: pool memory utilization: 97.3
>  zsmalloc_test: num read objects: 339598
>  zsmalloc_test: spanning objects: 110377, total memcpy size: 278318624
> 
> PATCHED
> =======
>  zsmalloc_test: num write objects: 339598
>  zsmalloc_test: pool pages used 175920, total allocated size 698213488
>  zsmalloc_test: pool memory utilization: 96.8
>  zsmalloc_test: num read objects: 339598
>  zsmalloc_test: spanning objects: 103256, total memcpy size: 265378608
> 
> At a price of 0.5% increased pool memory usage there was a 6.5%
> reduction in a number of spanning objects (4.6% less copied bytes).
> 
> Note, the results are specific to this particular test case.  The
> savings are not uniformly distributed: according to [2] for some
> size classes the reduction in the number of spanning objects
> per-zspage goes down from 7 to 0 (e.g. size class 368), for other
> from 4 to 2 (e.g. size class 640).  So the actual memcpy savings
> are data-pattern dependent, as always.

I worry that the heuristics are too hand-wavy, and I wonder if the
memcpy savings actually show up as perf improvements in any real life
workload. Do we have data about this?

I also vaguely recall discussions about other ways to avoid the memcpy
using scatterlists, so I am wondering if this is the right metric to
optimize.

What are the main pain points for PAGE_SIZE > 4K configs? Is it the
compression/decompression time? In my experience this is usually not the
bottleneck, I would imagine the real problem would be the internal
fragmentation.

> 
> [1] https://github.com/sergey-senozhatsky/simulate-zsmalloc/blob/main/0001-zsmalloc-add-zsmalloc_test-module.patch
> [2] https://github.com/sergey-senozhatsky/simulate-zsmalloc/blob/main/simulate_zsmalloc.c
> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@...omium.org>
> ---
>  mm/zsmalloc.c | 39 +++++++++++++++++++++++++++++++--------
>  1 file changed, 31 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 5e7501d36161..929db7cf6c19 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -2000,22 +2000,45 @@ static int zs_register_shrinker(struct zs_pool *pool)
>  static int calculate_zspage_chain_size(int class_size)
>  {
>  	int i, min_waste = INT_MAX;
> -	int chain_size = 1;
> +	int best_chain_size = 1;
>  
>  	if (is_power_of_2(class_size))
> -		return chain_size;
> +		return best_chain_size;
>  
>  	for (i = 1; i <= ZS_MAX_PAGES_PER_ZSPAGE; i++) {
> -		int waste;
> +		int curr_waste = (i * PAGE_SIZE) % class_size;
>  
> -		waste = (i * PAGE_SIZE) % class_size;
> -		if (waste < min_waste) {
> -			min_waste = waste;
> -			chain_size = i;
> +		if (curr_waste == 0)
> +			return i;
> +
> +		/*
> +		 * Accept the new chain size if:
> +		 * 1. The current best is wasteful (> 10% of zspage size),
> +		 *    accept anything that is better.
> +		 * 2. The current best is efficient, accept only significant
> +		 *    (25%) improvement.
> +		 */
> +		if (min_waste * 10 > best_chain_size * PAGE_SIZE) {
> +			if (curr_waste < min_waste) {
> +				min_waste = curr_waste;
> +				best_chain_size = i;
> +			}
> +		} else {
> +			if (curr_waste * 4 < min_waste * 3) {
> +				min_waste = curr_waste;
> +				best_chain_size = i;
> +			}
>  		}
> +
> +		/*
> +		 * If the current best chain has low waste (approx < 1.5%
> +		 * relative to zspage size) then accept it right away.
> +		 */
> +		if (min_waste * 64 <= best_chain_size * PAGE_SIZE)
> +			break;
>  	}
>  
> -	return chain_size;
> +	return best_chain_size;
>  }
>  
>  /**
> -- 
> 2.52.0.351.gbe84eed79e-goog
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ