lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250816000701.90784-1-sj@kernel.org>
Date: Fri, 15 Aug 2025 17:07:01 -0700
From: SeongJae Park <sj@...nel.org>
To: Chris Li <chrisl@...nel.org>
Cc: SeongJae Park <sj@...nel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Chengming Zhou <chengming.zhou@...ux.dev>,
	David Hildenbrand <david@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Nhat Pham <nphamcs@...il.com>,
	Yosry Ahmed <yosry.ahmed@...ux.dev>,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	Takero Funaki <flintglass@...il.com>,
	Hugh Dickins <hughd@...gle.com>
Subject: Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is

On Fri, 15 Aug 2025 15:28:59 -0700 Chris Li <chrisl@...nel.org> wrote:

> On Wed, Aug 13, 2025 at 11:20 AM SeongJae Park <sj@...nel.org> wrote:
[...]
> We might still wait to evaluate the lost/gain vs store the
> incompressible in swap cache. Google actually has an internal patch to
> store the incompressible pages in swap cache and move them out of the
> LRU to another already existing list. I can clean it up a bit and send
> it to the list for comparison.

This would be really nice!

[...]
> > So, if we agree on my justification about the metadata overhead, I think this
> > could be done as a followup work of this patch?
> 
> Sure. I am most interested in getting the best overall solution. No
> objects to get it now vs later.

Thank you for being flexible, Chris.  I'm also looking forward to keep
collaborating with you on the followup works!

[...]
> > > Is there any reason you want to store the page in zpool when the
> > > compression engine (not the ratio) fails?
> >
> > The main problem this patch tries to solve is the LRU order corruption.  In any
> > case, I want to keep the order if possible.
> 
> In that case, does it mean that in order to keep the LRU, you never
> want to write from zswap to a real back end device?

Not always, but until we have to write back zswapped pages to real back end
deevice, and all zswapped pages of lower LRU-order are already wrote back.

[...]
> > >      if (dlen >= PAGE_SIZE) {
> > >         zswap_compress_fail++;
> >
> > I define compress failure here as a failure of attempt to compress the given
> > page's content into a size smaller than PAGE_SIZE.  It is a superset including
> > both "ratio" failure and "engine" failure.  Hence I think zswap_compress_fail
> > should be increased even in the upper case.
> 
> I slept over it a bit. Now I think we should make this a counter of
> how many uncompressed pages count stored in zswap. Preperbelly as per
> memcg counter.

I agree that could be useful.  I will add the counter in the next version (v4).
But making it for each memcg may be out of the scope of this patch, in my
opinion.  Would you mind doing per-memcg counter implementation as a followup?

> I saw that you implement it as a counter in your V1.

Yes, though it was only for internal usage and therefore not exposed to the
user space.  I will make it again and expose to the user space via debugfs.
Say, stored_uncompressed_pages?

> Does the zsmalloc
> already track this information in the zspool class?

zsmalloc provides such information when CONFIG_ZSMALLOC_STAT is enabled, to my
understanding.

[...]
> I am actually less interested in the absolute failure number which
> keeps increasing, more on how much incompressible zswap is stored.
> That counter + number of engine errors should be perfect.

Sounds good.  So the next version (v4) of this patch will provide two new
debugfs counters, namely compress_engine_fail, and stored_uncompressed_pages.

[...]
> > I agree this code is nice to read.  Nonetheless I have to say the behavior is
> > somewhat different from what I want.
> 
> Even if you keep the current behavior, you can move the invert the
> test condition and then remove the "else + goto" similar to the above.
> That will make your code less and flatter. I will need to think about
> whether we can assign the return value less.

Nice idea.  What about below?

        if (comp_ret) {
                zswap_compress_engine_fail++;
                dlen = PAGE_SIZE;
        }
        if (dlen >= PAGE_SIZE) {
                zswap_compress_fail++;
                if (!mem_cgroup_zswap_writeback_enabled(
                                        folio_memcg(page_folio(page)))) {
                        comp_ret = -EINVAL;
                        goto unlock;
                }
                comp_ret = 0;
                dlen = PAGE_SIZE;
                dst = kmap_local_page(page);
        }

> 
> Another point I would like to make is that you currently make the cut
> off threshold as page size. The ideal threshold might be something
> slightly smaller than page size. The reason is that the zsmalloc has a
> fixed size chuck to store it. If your page is close to full, it will
> store the data in the same class as the full page. You are not gaining
> anything from zsmalloc if the store page compression ratio is 95%.
> That 5% will be in the waste in the zsmalloc class fragment anyway. So
> the trade off is, decompress 95% of a page vs memcpy 100% of a page.
> It is likely memcpy 100% is faster. I don't know the exact ideal cut
> off of threshold. If I had to guess, I would guess below 90%.

Agreed, this could be another nice followup work to do.

> 
> >
> > > I can
> > > be biased towards my own code :-).
> > > I think we should treat the compression engine failure separately from
> > > the compression rate failure. The engine failure is rare and we should
> > > know about it as a real error. The compression rate failure is normal.
> >
> > Again, I agree having the observability would be nice.  I can add a new counter
> > for that, like below attached patch, for example.
> 
> I would love that. Make that per memcg if possible :-)

As mentioned above, I would like to make only a global counter on debugfs for
now, if you don't mind.  Let me know if you mind.


Thanks,
SJ

[...]

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ