lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250731164323.15107-1-sj@kernel.org>
Date: Thu, 31 Jul 2025 09:43:23 -0700
From: SeongJae Park <sj@...nel.org>
To: Nhat Pham <nphamcs@...il.com>
Cc: SeongJae Park <sj@...nel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Chengming Zhou <chengming.zhou@...ux.dev>,
	Johannes Weiner <hannes@...xchg.org>,
	Takero Funaki <flintglass@...il.com>,
	Yosry Ahmed <yosry.ahmed@...ux.dev>,
	kernel-team@...a.com,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org
Subject: Re: [RFC PATCH] mm/zswap: store compression failed page as-is

Hi Nhat,

On Wed, 30 Jul 2025 17:21:44 -0700 Nhat Pham <nphamcs@...il.com> wrote:

> On Wed, Jul 30, 2025 at 4:41 PM SeongJae Park <sj@...nel.org> wrote:
> >
> > When zswap writeback is enabled and it fails compressing a given page,
> > zswap lets the page be swapped out to the backing swap device.  This
> > behavior breaks the zswap's writeback LRU order, and hence users can
> > experience unexpected latency spikes.
> >
> > Keep the LRU order by storing the original content in zswap as-is.  The
> > original content is saved in a dynamically allocated page size buffer,
> > and the pointer to the buffer is kept in zswap_entry, on the space for
> > zswap_entry->pool.  Whether the space is used for the original content
> > or zpool is identified by 'zswap_entry->length == PAGE_SIZE'.
[...]
> > ---
> >  mm/zswap.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 65 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 7e02c760955f..e021865696c6 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
[...]
> > +/*
> > + * If the compression is failed, try saving the content as is without
> > + * compression, to keep the LRU order.  This can increase memory overhead from
> > + * metadata, but in common zswap use cases where there are sufficient amount of
> > + * compressible pages, the overhead should be not ciritical, and can be
> > + * mitigated by the writeback.  Also, the decompression overhead is optimized.
> > + *
> > + * When the writeback is disabled, however, the additional overhead could be
> > + * problematic.  For the case, just return the failure.  swap_writeout() will
> > + * put the page back to the active LRU list in the case.
> > + */
> > +static int zswap_handle_compression_failure(int comp_ret, struct page *page,
> > +               struct zswap_entry *entry)
> > +{
> > +       if (!zswap_save_incompressible_pages)
> > +               return comp_ret;
> > +       if (!mem_cgroup_zswap_writeback_enabled(
> > +                               folio_memcg(page_folio(page))))
> > +               return comp_ret;
> > +
> > +       entry->orig_data = kmalloc_node(PAGE_SIZE, GFP_NOWAIT | __GFP_NORETRY |
> > +                       __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page));
> 
> Hmm, seems like this new buffer is not migratable (for compaction etc.?)
> 
> My understanding is that zsmalloc's allocated memory can be migrated
> (which is why zswap only works with a handle - it's a layer of
> indirection that gives zsmalloc the ability to move memory around).
> 
> Besides, why should we re-invent the wheel when zsmalloc already
> handles page-sized objects? :)

Makes sense, I will use zpool in the next version.

I actually saw both you and Takero did so in your versions, but I didn't
realize the migration benefit of the approach.  Thank you for enlightening me,
now I think this migration benefit is important, and I will make the next
version to provide the migratability reusing zpool.

> 
> > +       if (!entry->orig_data)
> > +               return -ENOMEM;
> > +       memcpy_from_page(entry->orig_data, page, 0, PAGE_SIZE);
> > +       entry->length = PAGE_SIZE;
> > +       atomic_long_inc(&zswap_stored_uncompressed_pages);
> > +       return 0;
> > +}
> > +
> >  static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> >                            struct zswap_pool *pool)
> >  {
> > @@ -976,8 +1023,11 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> >          */
> >         comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> >         dlen = acomp_ctx->req->dlen;
> > -       if (comp_ret)
> > +       if (comp_ret) {
> > +               comp_ret = zswap_handle_compression_failure(comp_ret, page,
> > +                               entry);
> >                 goto unlock;
> > +       }
> >
> >         zpool = pool->zpool;
> >         gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> > @@ -1009,6 +1059,11 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> >         int decomp_ret, dlen;
> >         u8 *src, *obj;
> >
> > +       if (entry->length == PAGE_SIZE) {
> > +               memcpy_to_folio(folio, 0, entry->orig_data, entry->length);
> > +               return true;
> > +       }
> 
> This might not be safe.
> 
> It's conceivable that in zswap_compress(), some compression algorithm
> "successfully" compresses a page to the same size (comp_ret == 0). We
> hand that to zsmalloc, which happily stores the page.
> 
> When we "decompress" the page again, we will attempt to
> memcpy_to_folio from a bogus address (the handle from zsmalloc).

Makes sense, thank you for catching this.

> 
> So, in zswap_compress, you have to treat both comp_ret == 0 and dlen
> == PAGE_SIZE as "compression failure".

I saw your reply saying you were meaning both comp_ret != 0 and dlen ==
PAGE_SIZE, and yes, this makes sense.  I will do so in the next version.


Thanks,
SJ

[...]

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ