[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250731164323.15107-1-sj@kernel.org>
Date: Thu, 31 Jul 2025 09:43:23 -0700
From: SeongJae Park <sj@...nel.org>
To: Nhat Pham <nphamcs@...il.com>
Cc: SeongJae Park <sj@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Chengming Zhou <chengming.zhou@...ux.dev>,
Johannes Weiner <hannes@...xchg.org>,
Takero Funaki <flintglass@...il.com>,
Yosry Ahmed <yosry.ahmed@...ux.dev>,
kernel-team@...a.com,
linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: [RFC PATCH] mm/zswap: store compression failed page as-is
Hi Nhat,
On Wed, 30 Jul 2025 17:21:44 -0700 Nhat Pham <nphamcs@...il.com> wrote:
> On Wed, Jul 30, 2025 at 4:41 PM SeongJae Park <sj@...nel.org> wrote:
> >
> > When zswap writeback is enabled and it fails compressing a given page,
> > zswap lets the page be swapped out to the backing swap device. This
> > behavior breaks the zswap's writeback LRU order, and hence users can
> > experience unexpected latency spikes.
> >
> > Keep the LRU order by storing the original content in zswap as-is. The
> > original content is saved in a dynamically allocated page size buffer,
> > and the pointer to the buffer is kept in zswap_entry, on the space for
> > zswap_entry->pool. Whether the space is used for the original content
> > or zpool is identified by 'zswap_entry->length == PAGE_SIZE'.
[...]
> > ---
> > mm/zswap.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++------
> > 1 file changed, 65 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 7e02c760955f..e021865696c6 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
[...]
> > +/*
> > + * If the compression is failed, try saving the content as is without
> > + * compression, to keep the LRU order. This can increase memory overhead from
> > + * metadata, but in common zswap use cases where there are sufficient amount of
> > + * compressible pages, the overhead should be not ciritical, and can be
> > + * mitigated by the writeback. Also, the decompression overhead is optimized.
> > + *
> > + * When the writeback is disabled, however, the additional overhead could be
> > + * problematic. For the case, just return the failure. swap_writeout() will
> > + * put the page back to the active LRU list in the case.
> > + */
> > +static int zswap_handle_compression_failure(int comp_ret, struct page *page,
> > + struct zswap_entry *entry)
> > +{
> > + if (!zswap_save_incompressible_pages)
> > + return comp_ret;
> > + if (!mem_cgroup_zswap_writeback_enabled(
> > + folio_memcg(page_folio(page))))
> > + return comp_ret;
> > +
> > + entry->orig_data = kmalloc_node(PAGE_SIZE, GFP_NOWAIT | __GFP_NORETRY |
> > + __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page));
>
> Hmm, seems like this new buffer is not migratable (for compaction etc.?)
>
> My understanding is that zsmalloc's allocated memory can be migrated
> (which is why zswap only works with a handle - it's a layer of
> indirection that gives zsmalloc the ability to move memory around).
>
> Besides, why should we re-invent the wheel when zsmalloc already
> handles page-sized objects? :)
Makes sense, I will use zpool in the next version.
I actually saw both you and Takero did so in your versions, but I didn't
realize the migration benefit of the approach. Thank you for enlightening me,
now I think this migration benefit is important, and I will make the next
version to provide the migratability reusing zpool.
>
> > + if (!entry->orig_data)
> > + return -ENOMEM;
> > + memcpy_from_page(entry->orig_data, page, 0, PAGE_SIZE);
> > + entry->length = PAGE_SIZE;
> > + atomic_long_inc(&zswap_stored_uncompressed_pages);
> > + return 0;
> > +}
> > +
> > static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > struct zswap_pool *pool)
> > {
> > @@ -976,8 +1023,11 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > */
> > comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> > dlen = acomp_ctx->req->dlen;
> > - if (comp_ret)
> > + if (comp_ret) {
> > + comp_ret = zswap_handle_compression_failure(comp_ret, page,
> > + entry);
> > goto unlock;
> > + }
> >
> > zpool = pool->zpool;
> > gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> > @@ -1009,6 +1059,11 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> > int decomp_ret, dlen;
> > u8 *src, *obj;
> >
> > + if (entry->length == PAGE_SIZE) {
> > + memcpy_to_folio(folio, 0, entry->orig_data, entry->length);
> > + return true;
> > + }
>
> This might not be safe.
>
> It's conceivable that in zswap_compress(), some compression algorithm
> "successfully" compresses a page to the same size (comp_ret == 0). We
> hand that to zsmalloc, which happily stores the page.
>
> When we "decompress" the page again, we will attempt to
> memcpy_to_folio from a bogus address (the handle from zsmalloc).
Makes sense, thank you for catching this.
>
> So, in zswap_compress, you have to treat both comp_ret == 0 and dlen
> == PAGE_SIZE as "compression failure".
I saw your reply saying you were meaning both comp_ret != 0 and dlen ==
PAGE_SIZE, and yes, this makes sense. I will do so in the next version.
Thanks,
SJ
[...]
Powered by blists - more mailing lists