[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=Mv_NC27HWe=MneiewHTSZ-0icJqE2PSikTW-o30n9R-A@mail.gmail.com>
Date: Wed, 30 Jul 2025 17:48:09 -0700
From: Nhat Pham <nphamcs@...il.com>
To: SeongJae Park <sj@...nel.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Chengming Zhou <chengming.zhou@...ux.dev>,
Johannes Weiner <hannes@...xchg.org>, Takero Funaki <flintglass@...il.com>,
Yosry Ahmed <yosry.ahmed@...ux.dev>, kernel-team@...a.com, linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: [RFC PATCH] mm/zswap: store compression failed page as-is
On Wed, Jul 30, 2025 at 4:41 PM SeongJae Park <sj@...nel.org> wrote:
Now, taking a look at the conceptual side of the patch:
>
> When zswap writeback is enabled and it fails compressing a given page,
> zswap lets the page be swapped out to the backing swap device. This
> behavior breaks the zswap's writeback LRU order, and hence users can
> experience unexpected latency spikes.
>
> Keep the LRU order by storing the original content in zswap as-is. The
> original content is saved in a dynamically allocated page size buffer,
> and the pointer to the buffer is kept in zswap_entry, on the space for
> zswap_entry->pool. Whether the space is used for the original content
> or zpool is identified by 'zswap_entry->length == PAGE_SIZE'.
>
> This avoids increasing per zswap entry metadata overhead. But as the
> number of incompressible pages increases, zswap metadata overhead is
> proportionally increased. The overhead should not be problematic in
> usual cases, since the zswap metadata for single zswap entry is much
> smaller than PAGE_SIZE, and in common zswap use cases there should be
> sufficient amount of compressible pages. Also it can be mitigated by
> the zswap writeback.
>
> When the memory pressure comes from memcg's memory.high and zswap
> writeback is set to be triggered for that, however, the penalty_jiffies
> sleep could degrade the performance. Add a parameter, namely
> 'save_incompressible_pages', to turn the feature on/off as users want.
> It is turned off by default.
>
> When the writeback is just disabled, the additional overhead could be
> problematic. For the case, keep the behavior that just returns the
> failure and let swap_writeout() puts the page back to the active LRU
> list in the case. It is known to be suboptimal when the incompressible
> pages are cold, since the incompressible pages will continuously be
> tried to be zswapped out, and burn CPU cycles for compression attempts
> that will anyway fails. But that's out of the scope of this patch.
As a follow up, we can reuse the "swapped out" page (and its struct
page) to store in the zswap pool (and LRU).
>
> Tests
> -----
>
> I tested this patch using a simple self-written microbenchmark that is
> available at GitHub[1]. You can reproduce the test I did by executing
> run_tests.sh of the repo on your system. Note that the repo's
> documentation is not good as of this writing, so you may need to read
> and use the code.
>
> The basic test scenario is simple. Run a test program making artificial
> accesses to memory having artificial content under memory.high-set
> memory limit and measure how many accesses were made in given time.
>
> The test program repeatedly and randomly access three anonymous memory
> regions. The regions are all 500 MiB size, and accessed in same
> probability. Two of those are filled up with a simple content that can
> easily be compressed, while the remaining one is filled up with a
> content that read from /dev/urandom, which is easy to fail at
> compressing. The program runs for two minutes and prints out the number
> of accesses made every five seconds.
>
> The test script runs the program under below seven configurations.
>
> - 0: memory.high is set to 2 GiB, zswap is disabled.
> - 1-1: memory.high is set to 1350 MiB, zswap is disabled.
> - 1-2: Same to 1-1, but zswap is enabled.
> - 1-3: Same to 1-2, but save_incompressible_pages is turned on.
> - 2-1: memory.high is set to 1200 MiB, zswap is disabled.
> - 2-2: Same to 2-1, but zswap is enabled.
> - 2-3: Same to 2-2, but save_incompressible_pages is turned on.
>
> Configuration '0' is for showing the original memory performance.
> Configurations 1-1, 1-2 and 1-3 are for showing the performance of swap,
> zswap, and this patch under a level of memory pressure (~10% of working
> set).
>
> Configurations 2-1, 2-2 and 2-3 are similar to 1-1, 1-2 and 1-3 but to
> show those under a severe level of memory pressure (~20% of the working
> set).
>
> Because the per-5 seconds performance is not very reliable, I measured
> the average of that for the last one minute period of the test program
> run. I also measured a few vmstat counters including zswpin, zswpout,
> zswpwb, pswpin and pswpout during the test runs.
>
> The measurement results are as below. To save space, I show performance
> numbers that are normalized to that of the configuration '0' (no memory
> pressure), only. The averaged accesses per 5 seconds of configuration
> '0' was 34612740.66.
>
> config 0 1-1 1-2 1-3 2-1 2-2 2-3
> perf_normalized 1.0000 0.0060 0.0230 0.0310 0.0030 0.0116 0.0003
> perf_stdev_ratio 0.06 0.04 0.11 0.14 0.03 0.05 0.26
> zswpin 0 0 1701702 6982188 0 2479848 815742
> zswpout 0 0 1725260 7048517 0 2535744 931420
> zswpwb 0 0 0 0 0 0 0
> pswpin 0 468612 481270 0 476434 535772 0
> pswpout 0 574634 689625 0 612683 591881 0
>
> 'perf_normalized' is the performance metric, normalized to that of
> configuration '0' (no pressure). 'perf_stdev_ratio' is the standard
> deviation of the averaged data points, as a ratio to the averaged metric
> value. For example, configuration '0' performance was showing 6% stdev.
> Configurations 1-2 and 1-3 were having about 11% and 14% stdev. So the
> measurement is not very stable. Please keep this in your mind when
> reading these results. It shows some rough patterns, though.
>
> Under about 10% of working set memory pressure, the performance was
> dropped to about 0.6% of no-pressure one, when the normal swap is used
> (1-1). Actually ~10% working set pressure is not a mild one, at least
> on this test setup.
>
> By turning zswap on (1-2), the performance was improved about 4x,
> resulting in about 2.3% of no-pressure one. Because of the
> incompressible pages in the third memory region, a significant amount of
> (non-zswap) swap I/O operations were made, though.
>
> By enabling the incompressible pages handling feature that is introduced
> by this patch (1-3), about 34% performance improvement was made,
> resulting in about 3.1% of no-pressure one. As expected, compression
> failed pages were handled by zswap, and hence no (non-zswap) swap I/O
> was made in this configuration.
>
> Under about 20% of working set memory pressure, which could be extreme,
> the performance drops down to 0.3% of no-pressure one when only the
> normal swap is used (2-1). Enabling zswap significantly improves it, up
> to 1.16%, though again showing a significant number of (non-zswap) swap
> I/O due to incompressible pages.
>
> Enabling the incompressible pages handling feature of this patch (2-3)
> reduced non-zswap swap I/O as expected, but the performance became
> worse, 0.03% of no-pressure one. It turned out this is because of
> memory.high-imposed penalty_jiffies sleep. By commenting out the
> penalty_jiffies sleep code of mem_cgroup_handle_over_high(), the
> performance became higher than 2-2.
Yeah this is pretty extreme. I suppose you can construct scenarios
where disk swapping delays are still better than memory.high
violations throttling.
That said, I suppose under very hard memory pressure like this, users
must make a choice anyway:
1. If they're OK with disk swapping, consider enabling zswap shrinker
:) That'll evict a bunch of compressed objects from zswap to disk swap
to avoid memory limit violations.
2. If they're not OK with disk swapping, then throttling/OOMs is
unavoidable. In fact, we're speeding up the OOM process, which is
arguably desirable? This is precisely the pathological behavior that
some of our workloads are observing in production - they spin the
wheel for a really long time trying (and failing) to reclaim
incompressible anonymous memory, hogging the CPUs.
>
> 20% of working set memory pressure is pretty extreme, but anyway the
> incompressible pages handling feature could make it worse in certain
> setups. Hence this version of the patch adds the parameter for turning
> the feature on/off as needed, and makes it disabled by default.
>
> Related Works
> -------------
>
> This is not an entirely new attempt. There were a couple of similar
> approaches in the past. Nhat Pham tried a very same approach[2] in
> October 2023. Takero Funaki also tried a very similar approach[3] in
> April 2024.
>
> The two approaches didn't get merged mainly due to the metadata overhead
> concern. I described why I think that shouldn't be a problem for this
> change, which is automatically disabled when writeback is disabled, at
> the beginning of this changelog.
>
> This patch is not particularly different from those, and actually built
> upon those. I wrote this from scratch again, though. I have no good
> idea about how I can give credits to the authors of the previously made
> nearly-same attempts, and I should be responsible to maintain this
> change if this is upstreamed, so taking the authorship for now. Please
> let me know if you know a better way to give them their deserved
> credits.
I'd take Suggested-by if you feel bad ;)
But, otherwise, no need to add me as author! Unless you copy the OG
patch in future versions lol.
>
> [1] https://github.com/sjp38/eval_zswap/blob/master/run.sh
> [2] https://lore.kernel.org/20231017003519.1426574-3-nphamcs@gmail.com
> [3] https://lore.kernel.org/20240706022523.1104080-6-flintglass@gmail.com
>
> Signed-off-by: SeongJae Park <sj@...nel.org>
> ---
> mm/zswap.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 65 insertions(+), 8 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 7e02c760955f..e021865696c6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -45,6 +45,9 @@
> /* The number of compressed pages currently stored in zswap */
> atomic_long_t zswap_stored_pages = ATOMIC_LONG_INIT(0);
>
> +/* The number of uncompressed pages currently stored in zswap */
> +atomic_long_t zswap_stored_uncompressed_pages = ATOMIC_LONG_INIT(0);
> +
> /*
> * The statistics below are not protected from concurrent access for
> * performance reasons so they may not be a 100% accurate. However,
> @@ -129,6 +132,11 @@ static bool zswap_shrinker_enabled = IS_ENABLED(
> CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
> module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
>
> +/* Store incompressible pages as is */
> +static bool zswap_save_incompressible_pages;
> +module_param_named(save_incompressible_pages, zswap_save_incompressible_pages,
> + bool, 0644);
> +
> bool zswap_is_enabled(void)
> {
> return zswap_enabled;
> @@ -190,7 +198,8 @@ static struct shrinker *zswap_shrinker;
> * writeback logic. The entry is only reclaimed by the writeback
> * logic if referenced is unset. See comments in the shrinker
> * section for context.
> - * pool - the zswap_pool the entry's data is in
> + * orig_data - uncompressed original data of the page, if length is PAGE_SIZE
> + * pool - the zswap_pool the entry's data is in, if length is not PAGE_SIZE
> * handle - zpool allocation handle that stores the compressed page data
> * objcg - the obj_cgroup that the compressed memory is charged to
> * lru - handle to the pool's lru used to evict pages.
> @@ -199,7 +208,10 @@ struct zswap_entry {
> swp_entry_t swpentry;
> unsigned int length;
> bool referenced;
> - struct zswap_pool *pool;
> + union {
> + void *orig_data;
> + struct zswap_pool *pool;
> + };
> unsigned long handle;
> struct obj_cgroup *objcg;
> struct list_head lru;
> @@ -500,7 +512,7 @@ unsigned long zswap_total_pages(void)
> total += zpool_get_total_pages(pool->zpool);
> rcu_read_unlock();
>
> - return total;
> + return total + atomic_long_read(&zswap_stored_uncompressed_pages);
> }
>
> static bool zswap_check_limits(void)
> @@ -805,8 +817,13 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
> static void zswap_entry_free(struct zswap_entry *entry)
> {
> zswap_lru_del(&zswap_list_lru, entry);
> - zpool_free(entry->pool->zpool, entry->handle);
> - zswap_pool_put(entry->pool);
> + if (entry->length == PAGE_SIZE) {
> + kfree(entry->orig_data);
> + atomic_long_dec(&zswap_stored_uncompressed_pages);
> + } else {
> + zpool_free(entry->pool->zpool, entry->handle);
> + zswap_pool_put(entry->pool);
> + }
> if (entry->objcg) {
> obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
> obj_cgroup_put(entry->objcg);
> @@ -937,6 +954,36 @@ static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx)
> mutex_unlock(&acomp_ctx->mutex);
> }
>
> +/*
> + * If the compression is failed, try saving the content as is without
> + * compression, to keep the LRU order. This can increase memory overhead from
> + * metadata, but in common zswap use cases where there are sufficient amount of
> + * compressible pages, the overhead should be not ciritical, and can be
> + * mitigated by the writeback. Also, the decompression overhead is optimized.
> + *
> + * When the writeback is disabled, however, the additional overhead could be
> + * problematic. For the case, just return the failure. swap_writeout() will
> + * put the page back to the active LRU list in the case.
> + */
> +static int zswap_handle_compression_failure(int comp_ret, struct page *page,
> + struct zswap_entry *entry)
> +{
> + if (!zswap_save_incompressible_pages)
> + return comp_ret;
> + if (!mem_cgroup_zswap_writeback_enabled(
> + folio_memcg(page_folio(page))))
> + return comp_ret;
> +
> + entry->orig_data = kmalloc_node(PAGE_SIZE, GFP_NOWAIT | __GFP_NORETRY |
> + __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page));
> + if (!entry->orig_data)
> + return -ENOMEM;
> + memcpy_from_page(entry->orig_data, page, 0, PAGE_SIZE);
> + entry->length = PAGE_SIZE;
> + atomic_long_inc(&zswap_stored_uncompressed_pages);
> + return 0;
> +}
> +
> static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> struct zswap_pool *pool)
> {
> @@ -976,8 +1023,11 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> */
> comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> dlen = acomp_ctx->req->dlen;
> - if (comp_ret)
> + if (comp_ret) {
> + comp_ret = zswap_handle_compression_failure(comp_ret, page,
> + entry);
> goto unlock;
> + }
>
> zpool = pool->zpool;
> gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> @@ -1009,6 +1059,11 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> int decomp_ret, dlen;
> u8 *src, *obj;
>
> + if (entry->length == PAGE_SIZE) {
> + memcpy_to_folio(folio, 0, entry->orig_data, entry->length);
> + return true;
> + }
> +
> acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
> obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);
>
> @@ -1518,7 +1573,8 @@ static bool zswap_store_page(struct page *page,
> * The opposite actions will be performed by zswap_entry_free()
> * when the entry is removed from the tree.
> */
> - zswap_pool_get(pool);
> + if (entry->length != PAGE_SIZE)
> + zswap_pool_get(pool);
> if (objcg) {
> obj_cgroup_get(objcg);
> obj_cgroup_charge_zswap(objcg, entry->length);
> @@ -1535,7 +1591,8 @@ static bool zswap_store_page(struct page *page,
> * The publishing order matters to prevent writeback from seeing
> * an incoherent entry.
> */
> - entry->pool = pool;
> + if (entry->length != PAGE_SIZE)
> + entry->pool = pool;
> entry->swpentry = page_swpentry;
> entry->objcg = objcg;
> entry->referenced = true;
>
> base-commit: b771a67861b3538324c3df25d337f4713ee53e03
> --
> 2.39.5
Powered by blists - more mailing lists