[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250731165608.15347-1-sj@kernel.org>
Date: Thu, 31 Jul 2025 09:56:08 -0700
From: SeongJae Park <sj@...nel.org>
To: Nhat Pham <nphamcs@...il.com>
Cc: SeongJae Park <sj@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Chengming Zhou <chengming.zhou@...ux.dev>,
Johannes Weiner <hannes@...xchg.org>,
Takero Funaki <flintglass@...il.com>,
Yosry Ahmed <yosry.ahmed@...ux.dev>,
kernel-team@...a.com,
linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: [RFC PATCH] mm/zswap: store compression failed page as-is
On Wed, 30 Jul 2025 17:48:09 -0700 Nhat Pham <nphamcs@...il.com> wrote:
> On Wed, Jul 30, 2025 at 4:41 PM SeongJae Park <sj@...nel.org> wrote:
>
> Now, taking a look at the conceptual side of the patch:
>
> >
> > When zswap writeback is enabled and it fails compressing a given page,
> > zswap lets the page be swapped out to the backing swap device. This
> > behavior breaks the zswap's writeback LRU order, and hence users can
> > experience unexpected latency spikes.
> >
> > Keep the LRU order by storing the original content in zswap as-is. The
> > original content is saved in a dynamically allocated page size buffer,
> > and the pointer to the buffer is kept in zswap_entry, on the space for
> > zswap_entry->pool. Whether the space is used for the original content
> > or zpool is identified by 'zswap_entry->length == PAGE_SIZE'.
> >
> > This avoids increasing per zswap entry metadata overhead. But as the
> > number of incompressible pages increases, zswap metadata overhead is
> > proportionally increased. The overhead should not be problematic in
> > usual cases, since the zswap metadata for single zswap entry is much
> > smaller than PAGE_SIZE, and in common zswap use cases there should be
> > sufficient amount of compressible pages. Also it can be mitigated by
> > the zswap writeback.
> >
> > When the memory pressure comes from memcg's memory.high and zswap
> > writeback is set to be triggered for that, however, the penalty_jiffies
> > sleep could degrade the performance. Add a parameter, namely
> > 'save_incompressible_pages', to turn the feature on/off as users want.
> > It is turned off by default.
> >
> > When the writeback is just disabled, the additional overhead could be
> > problematic. For the case, keep the behavior that just returns the
> > failure and let swap_writeout() puts the page back to the active LRU
> > list in the case. It is known to be suboptimal when the incompressible
> > pages are cold, since the incompressible pages will continuously be
> > tried to be zswapped out, and burn CPU cycles for compression attempts
> > that will anyway fails. But that's out of the scope of this patch.
>
> As a follow up, we can reuse the "swapped out" page (and its struct
> page) to store in the zswap pool (and LRU).
Thank you for adding this. I will include this into the changelog of the next
version of this patch.
>
> >
> > Tests
> > -----
> >
> > I tested this patch using a simple self-written microbenchmark that is
> > available at GitHub[1]. You can reproduce the test I did by executing
> > run_tests.sh of the repo on your system. Note that the repo's
> > documentation is not good as of this writing, so you may need to read
> > and use the code.
> >
> > The basic test scenario is simple. Run a test program making artificial
> > accesses to memory having artificial content under memory.high-set
> > memory limit and measure how many accesses were made in given time.
> >
> > The test program repeatedly and randomly access three anonymous memory
> > regions. The regions are all 500 MiB size, and accessed in same
> > probability. Two of those are filled up with a simple content that can
> > easily be compressed, while the remaining one is filled up with a
> > content that read from /dev/urandom, which is easy to fail at
> > compressing. The program runs for two minutes and prints out the number
> > of accesses made every five seconds.
> >
> > The test script runs the program under below seven configurations.
> >
> > - 0: memory.high is set to 2 GiB, zswap is disabled.
> > - 1-1: memory.high is set to 1350 MiB, zswap is disabled.
> > - 1-2: Same to 1-1, but zswap is enabled.
> > - 1-3: Same to 1-2, but save_incompressible_pages is turned on.
> > - 2-1: memory.high is set to 1200 MiB, zswap is disabled.
> > - 2-2: Same to 2-1, but zswap is enabled.
> > - 2-3: Same to 2-2, but save_incompressible_pages is turned on.
> >
> > Configuration '0' is for showing the original memory performance.
> > Configurations 1-1, 1-2 and 1-3 are for showing the performance of swap,
> > zswap, and this patch under a level of memory pressure (~10% of working
> > set).
> >
> > Configurations 2-1, 2-2 and 2-3 are similar to 1-1, 1-2 and 1-3 but to
> > show those under a severe level of memory pressure (~20% of the working
> > set).
> >
> > Because the per-5 seconds performance is not very reliable, I measured
> > the average of that for the last one minute period of the test program
> > run. I also measured a few vmstat counters including zswpin, zswpout,
> > zswpwb, pswpin and pswpout during the test runs.
> >
> > The measurement results are as below. To save space, I show performance
> > numbers that are normalized to that of the configuration '0' (no memory
> > pressure), only. The averaged accesses per 5 seconds of configuration
> > '0' was 34612740.66.
> >
> > config 0 1-1 1-2 1-3 2-1 2-2 2-3
> > perf_normalized 1.0000 0.0060 0.0230 0.0310 0.0030 0.0116 0.0003
> > perf_stdev_ratio 0.06 0.04 0.11 0.14 0.03 0.05 0.26
> > zswpin 0 0 1701702 6982188 0 2479848 815742
> > zswpout 0 0 1725260 7048517 0 2535744 931420
> > zswpwb 0 0 0 0 0 0 0
> > pswpin 0 468612 481270 0 476434 535772 0
> > pswpout 0 574634 689625 0 612683 591881 0
> >
> > 'perf_normalized' is the performance metric, normalized to that of
> > configuration '0' (no pressure). 'perf_stdev_ratio' is the standard
> > deviation of the averaged data points, as a ratio to the averaged metric
> > value. For example, configuration '0' performance was showing 6% stdev.
> > Configurations 1-2 and 1-3 were having about 11% and 14% stdev. So the
> > measurement is not very stable. Please keep this in your mind when
> > reading these results. It shows some rough patterns, though.
> >
> > Under about 10% of working set memory pressure, the performance was
> > dropped to about 0.6% of no-pressure one, when the normal swap is used
> > (1-1). Actually ~10% working set pressure is not a mild one, at least
> > on this test setup.
> >
> > By turning zswap on (1-2), the performance was improved about 4x,
> > resulting in about 2.3% of no-pressure one. Because of the
> > incompressible pages in the third memory region, a significant amount of
> > (non-zswap) swap I/O operations were made, though.
> >
> > By enabling the incompressible pages handling feature that is introduced
> > by this patch (1-3), about 34% performance improvement was made,
> > resulting in about 3.1% of no-pressure one. As expected, compression
> > failed pages were handled by zswap, and hence no (non-zswap) swap I/O
> > was made in this configuration.
> >
> > Under about 20% of working set memory pressure, which could be extreme,
> > the performance drops down to 0.3% of no-pressure one when only the
> > normal swap is used (2-1). Enabling zswap significantly improves it, up
> > to 1.16%, though again showing a significant number of (non-zswap) swap
> > I/O due to incompressible pages.
> >
> > Enabling the incompressible pages handling feature of this patch (2-3)
> > reduced non-zswap swap I/O as expected, but the performance became
> > worse, 0.03% of no-pressure one. It turned out this is because of
> > memory.high-imposed penalty_jiffies sleep. By commenting out the
> > penalty_jiffies sleep code of mem_cgroup_handle_over_high(), the
> > performance became higher than 2-2.
>
> Yeah this is pretty extreme. I suppose you can construct scenarios
> where disk swapping delays are still better than memory.high
> violations throttling.
>
> That said, I suppose under very hard memory pressure like this, users
> must make a choice anyway:
>
> 1. If they're OK with disk swapping, consider enabling zswap shrinker
> :) That'll evict a bunch of compressed objects from zswap to disk swap
> to avoid memory limit violations.
I agree. And while I'm arguing I believe this is ok when writeback is enabled,
my test is not exploring the case since it is not enabling zswap shrinker or
set max_pool_percent. And hence the test results show zero zswpwb always. I
will extend the test setup to explore this case.
>
> 2. If they're not OK with disk swapping, then throttling/OOMs is
> unavoidable. In fact, we're speeding up the OOM process, which is
> arguably desirable? This is precisely the pathological behavior that
> some of our workloads are observing in production - they spin the
> wheel for a really long time trying (and failing) to reclaim
> incompressible anonymous memory, hogging the CPUs.
I agree. When memory.max is appropriately set, I think the system will work in
the desirable way.
>
> >
> > 20% of working set memory pressure is pretty extreme, but anyway the
> > incompressible pages handling feature could make it worse in certain
> > setups. Hence this version of the patch adds the parameter for turning
> > the feature on/off as needed, and makes it disabled by default.
> >
> > Related Works
> > -------------
> >
> > This is not an entirely new attempt. There were a couple of similar
> > approaches in the past. Nhat Pham tried a very same approach[2] in
> > October 2023. Takero Funaki also tried a very similar approach[3] in
> > April 2024.
> >
> > The two approaches didn't get merged mainly due to the metadata overhead
> > concern. I described why I think that shouldn't be a problem for this
> > change, which is automatically disabled when writeback is disabled, at
> > the beginning of this changelog.
> >
> > This patch is not particularly different from those, and actually built
> > upon those. I wrote this from scratch again, though. I have no good
> > idea about how I can give credits to the authors of the previously made
> > nearly-same attempts, and I should be responsible to maintain this
> > change if this is upstreamed, so taking the authorship for now. Please
> > let me know if you know a better way to give them their deserved
> > credits.
>
> I'd take Suggested-by if you feel bad ;)
>
> But, otherwise, no need to add me as author! Unless you copy the OG
> patch in future versions lol.
Nice idea, thank you for suggesting Suggested-by: :)
I'll add Suggested-by: tags for you and Takero in the next version.
Thanks,
SJ
[...]
Powered by blists - more mailing lists