linux-kernel - Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF8kJuPCpnGAoz=1CBwe46ytpcR0ZUMAEWLFQce7eUWkb+0ERA@mail.gmail.com>
Date: Fri, 15 Aug 2025 19:20:35 -0700
From: Chris Li <chrisl@...nel.org>
To: SeongJae Park <sj@...nel.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Chengming Zhou <chengming.zhou@...ux.dev>, 
	David Hildenbrand <david@...hat.com>, Johannes Weiner <hannes@...xchg.org>, Nhat Pham <nphamcs@...il.com>, 
	Yosry Ahmed <yosry.ahmed@...ux.dev>, linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	Takero Funaki <flintglass@...il.com>, Hugh Dickins <hughd@...gle.com>
Subject: Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is

On Fri, Aug 15, 2025 at 5:07 PM SeongJae Park <sj@...nel.org> wrote:
>
> On Fri, 15 Aug 2025 15:28:59 -0700 Chris Li <chrisl@...nel.org> wrote:
> > Sure. I am most interested in getting the best overall solution. No
> > objects to get it now vs later.
>
> Thank you for being flexible, Chris.  I'm also looking forward to keep
> collaborating with you on the followup works!

Likewise.

>
> [...]
> > > > Is there any reason you want to store the page in zpool when the
> > > > compression engine (not the ratio) fails?
> > >
> > > The main problem this patch tries to solve is the LRU order corruption.  In any
> > > case, I want to keep the order if possible.
> >
> > In that case, does it mean that in order to keep the LRU, you never
> > want to write from zswap to a real back end device?
>
> Not always, but until we have to write back zswapped pages to real back end
> deevice, and all zswapped pages of lower LRU-order are already wrote back.

I see, that makes sense. Only writers to the lower tier while
maintaining LRU order.
Thanks for the explanation. I can see the bigger picture now.

> > I slept over it a bit. Now I think we should make this a counter of
> > how many uncompressed pages count stored in zswap. Preperbelly as per
> > memcg counter.
>
> I agree that could be useful.  I will add the counter in the next version (v4).
> But making it for each memcg may be out of the scope of this patch, in my
> opinion.  Would you mind doing per-memcg counter implementation as a followup?

No objections. If the incompressible page count is very bad, it will
drag down the overall compression ratio as well. So we do have some
per memcg signal to pick it up. The trade off is how complex we want
to go for those marginal finer grain measurements.

> > I saw that you implement it as a counter in your V1.
>
> Yes, though it was only for internal usage and therefore not exposed to the
> user space.  I will make it again and expose to the user space via debugfs.
> Say, stored_uncompressed_pages?

I am not very good at naming either. Maybe
"stored_incompressible_pages"? Uncompressed pages are the resulting
state, it sounds like we choose to store them uncompressed. The root
cause is that those pages are incompressible. That is the nature of
the page,  we don't have a choice to compress them.

>
> > Does the zsmalloc
> > already track this information in the zspool class?
>
> zsmalloc provides such information when CONFIG_ZSMALLOC_STAT is enabled, to my
> understanding.

Thanks. I think that is likely global not per memcg.

>
> [...]
> > I am actually less interested in the absolute failure number which
> > keeps increasing, more on how much incompressible zswap is stored.
> > That counter + number of engine errors should be perfect.
>
> Sounds good.  So the next version (v4) of this patch will provide two new
> debugfs counters, namely compress_engine_fail, and stored_uncompressed_pages.

Sounds good. Maybe "crypto_compress_fail", the engine is the common
name as I call it. Here the engine refers to the crypto library so
"crypto_compress_fail" matches the code better.

>
> [...]
> > > I agree this code is nice to read.  Nonetheless I have to say the behavior is
> > > somewhat different from what I want.
> >
> > Even if you keep the current behavior, you can move the invert the
> > test condition and then remove the "else + goto" similar to the above.
> > That will make your code less and flatter. I will need to think about
> > whether we can assign the return value less.
>
> Nice idea.  What about below?
>
>         if (comp_ret) {

You can consider adding (comp_ret || !dlen) , because if dlen == 0,
something is seriously wrong with the compression, that is a real
error.

>                 zswap_compress_engine_fail++;
>                 dlen = PAGE_SIZE;
>         }
>         if (dlen >= PAGE_SIZE) {
>                 zswap_compress_fail++;
>                 if (!mem_cgroup_zswap_writeback_enabled(
>                                         folio_memcg(page_folio(page)))) {
>                         comp_ret = -EINVAL;
>                         goto unlock;
>                 }
>                 comp_ret = 0;
>                 dlen = PAGE_SIZE;
>                 dst = kmap_local_page(page);
>         }

Looks very good to me. Much cleaner than the V2. Thanks for adopting that.

> > Another point I would like to make is that you currently make the cut
> > off threshold as page size. The ideal threshold might be something
> > slightly smaller than page size. The reason is that the zsmalloc has a
> > fixed size chuck to store it. If your page is close to full, it will
> > store the data in the same class as the full page. You are not gaining
> > anything from zsmalloc if the store page compression ratio is 95%.
> > That 5% will be in the waste in the zsmalloc class fragment anyway. So
> > the trade off is, decompress 95% of a page vs memcpy 100% of a page.
> > It is likely memcpy 100% is faster. I don't know the exact ideal cut
> > off of threshold. If I had to guess, I would guess below 90%.
>
> Agreed, this could be another nice followup work to do.

Ack.

>
> >
> > >
> > > > I can
> > > > be biased towards my own code :-).
> > > > I think we should treat the compression engine failure separately from
> > > > the compression rate failure. The engine failure is rare and we should
> > > > know about it as a real error. The compression rate failure is normal.
> > >
> > > Again, I agree having the observability would be nice.  I can add a new counter
> > > for that, like below attached patch, for example.
> >
> > I would love that. Make that per memcg if possible :-)
>
> As mentioned above, I would like to make only a global counter on debugfs for
> now, if you don't mind.  Let me know if you mind.

I don't mind. We can take the incremental approach.

Chris