linux-kernel - Re: [RFC PATCH] mm/zswap: store compression failed page as-is

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250731172009.108025-1-joshua.hahnjy@gmail.com>
Date: Thu, 31 Jul 2025 10:20:05 -0700
From: Joshua Hahn <joshua.hahnjy@...il.com>
To: SeongJae Park <sj@...nel.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	Chengming Zhou <chengming.zhou@...ux.dev>,
	Johannes Weiner <hannes@...xchg.org>,
	Nhat Pham <nphamcs@...il.com>,
	Takero Funaki <flintglass@...il.com>,
	Yosry Ahmed <yosry.ahmed@...ux.dev>,
	kernel-team@...a.com,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org
Subject: Re: [RFC PATCH] mm/zswap: store compression failed page as-is

On Wed, 30 Jul 2025 16:40:59 -0700 SeongJae Park <sj@...nel.org> wrote:

Hi SJ, thank you for your patch! This is a really cool idea : -)

> When zswap writeback is enabled and it fails compressing a given page,
> zswap lets the page be swapped out to the backing swap device.  This
> behavior breaks the zswap's writeback LRU order, and hence users can
> experience unexpected latency spikes.
> 
> Keep the LRU order by storing the original content in zswap as-is.  The
> original content is saved in a dynamically allocated page size buffer,
> and the pointer to the buffer is kept in zswap_entry, on the space for
> zswap_entry->pool.  Whether the space is used for the original content
> or zpool is identified by 'zswap_entry->length == PAGE_SIZE'.
> 
> This avoids increasing per zswap entry metadata overhead.  But as the
> number of incompressible pages increases, zswap metadata overhead is
> proportionally increased.  The overhead should not be problematic in
> usual cases, since the zswap metadata for single zswap entry is much
> smaller than PAGE_SIZE, and in common zswap use cases there should be
> sufficient amount of compressible pages.  Also it can be mitigated by
> the zswap writeback.

I think one other benefit that is worth mentioning here is that by moving
the incompressible pages to the zswap LRU, we prevent wasting CPU cycles
on pages that have not changed since their last compression attempt.

This concern is a real concern that we have seen manifest as wasted cycles
in zswap wasted on trying to compress the same pages over and over again,
and failing each time. This is also made worse by the fact that wasted CPU
cycles are bad, but even worse when we are reclaiming and must free up memory!

> When the memory pressure comes from memcg's memory.high and zswap
> writeback is set to be triggered for that, however, the penalty_jiffies
> sleep could degrade the performance.  Add a parameter, namely
> 'save_incompressible_pages', to turn the feature on/off as users want.
> It is turned off by default.
> 
> When the writeback is just disabled, the additional overhead could be
> problematic.  For the case, keep the behavior that just returns the
> failure and let swap_writeout() puts the page back to the active LRU
> list in the case.  It is known to be suboptimal when the incompressible
> pages are cold, since the incompressible pages will continuously be
> tried to be zswapped out, and burn CPU cycles for compression attempts
> that will anyway fails.  But that's out of the scope of this patch.

Indeed, and your patch fixes this problem, at least for the writeback
enabled case!

[...snip...]

> Signed-off-by: SeongJae Park <sj@...nel.org>
> ---
>  mm/zswap.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 65 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 7e02c760955f..e021865696c6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -45,6 +45,9 @@
>  /* The number of compressed pages currently stored in zswap */
>  atomic_long_t zswap_stored_pages = ATOMIC_LONG_INIT(0);
>  
> +/* The number of uncompressed pages currently stored in zswap */
> +atomic_long_t zswap_stored_uncompressed_pages = ATOMIC_LONG_INIT(0);
> +
>  /*
>   * The statistics below are not protected from concurrent access for
>   * performance reasons so they may not be a 100% accurate.  However,
> @@ -129,6 +132,11 @@ static bool zswap_shrinker_enabled = IS_ENABLED(
>  		CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
>  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
>  
> +/* Store incompressible pages as is */
> +static bool zswap_save_incompressible_pages;
> +module_param_named(save_incompressible_pages, zswap_save_incompressible_pages,
> +		bool, 0644);
> +
>  bool zswap_is_enabled(void)
>  {
>  	return zswap_enabled;
> @@ -190,7 +198,8 @@ static struct shrinker *zswap_shrinker;
>   *              writeback logic. The entry is only reclaimed by the writeback
>   *              logic if referenced is unset. See comments in the shrinker
>   *              section for context.
> - * pool - the zswap_pool the entry's data is in
> + * orig_data - uncompressed original data of the page, if length is PAGE_SIZE
> + * pool - the zswap_pool the entry's data is in, if length is not PAGE_SIZE
>   * handle - zpool allocation handle that stores the compressed page data
>   * objcg - the obj_cgroup that the compressed memory is charged to
>   * lru - handle to the pool's lru used to evict pages.
> @@ -199,7 +208,10 @@ struct zswap_entry {
>  	swp_entry_t swpentry;
>  	unsigned int length;
>  	bool referenced;
> -	struct zswap_pool *pool;
> +	union {
> +		void *orig_data;
> +		struct zswap_pool *pool;
> +	};
>  	unsigned long handle;
>  	struct obj_cgroup *objcg;
>  	struct list_head lru;
> @@ -500,7 +512,7 @@ unsigned long zswap_total_pages(void)
>  		total += zpool_get_total_pages(pool->zpool);
>  	rcu_read_unlock();
>  
> -	return total;
> +	return total + atomic_long_read(&zswap_stored_uncompressed_pages);
>  }
>  
>  static bool zswap_check_limits(void)
> @@ -805,8 +817,13 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
>  static void zswap_entry_free(struct zswap_entry *entry)
>  {
>  	zswap_lru_del(&zswap_list_lru, entry);
> -	zpool_free(entry->pool->zpool, entry->handle);
> -	zswap_pool_put(entry->pool);
> +	if (entry->length == PAGE_SIZE) {
> +		kfree(entry->orig_data);
> +		atomic_long_dec(&zswap_stored_uncompressed_pages);
> +	} else {
> +		zpool_free(entry->pool->zpool, entry->handle);
> +		zswap_pool_put(entry->pool);
> +	}
>  	if (entry->objcg) {
>  		obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
>  		obj_cgroup_put(entry->objcg);
> @@ -937,6 +954,36 @@ static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx)
>  	mutex_unlock(&acomp_ctx->mutex);
>  }
>  
> +/*
> + * If the compression is failed, try saving the content as is without
> + * compression, to keep the LRU order.  This can increase memory overhead from
> + * metadata, but in common zswap use cases where there are sufficient amount of
> + * compressible pages, the overhead should be not ciritical, and can be

					      NIT: s/ciritical/critical/?

> + * mitigated by the writeback.  Also, the decompression overhead is optimized.
> + *
> + * When the writeback is disabled, however, the additional overhead could be
> + * problematic.  For the case, just return the failure.  swap_writeout() will

					           NIT: s/  / /?

[...snip...]

And I think Nhat and Johannes have covered the other concerns.

Thanks again SJ for working on this, I hope you have a great day!
Joshua

Sent using hkml (https://github.com/sjp38/hackermail)