linux-kernel - Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAF8kJuO3-Wuf567cLmudG9+ioaB+u8Yt7EiFaMMROK8R-2KCUA@mail.gmail.com>
Date: Fri, 15 Aug 2025 17:30:22 -0700
From: Chris Li <chrisl@...nel.org>
To: Nhat Pham <nphamcs@...il.com>
Cc: SeongJae Park <sj@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>, 
	Chengming Zhou <chengming.zhou@...ux.dev>, David Hildenbrand <david@...hat.com>, 
	Johannes Weiner <hannes@...xchg.org>, Yosry Ahmed <yosry.ahmed@...ux.dev>, linux-kernel@...r.kernel.org, 
	linux-mm@...ck.org, Takero Funaki <flintglass@...il.com>, Hugh Dickins <hughd@...gle.com>
Subject: Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is

On Fri, Aug 15, 2025 at 4:09 PM Nhat Pham <nphamcs@...il.com> wrote:
>

> > Keep in mind that is more than just metadata. There is another hidden
> > overhead of using zpool to store your full page compared to directly
> > using the page.  When the page is read, because most likely the zpool
> > back end is not at a position aligned to the page. That zpool page
> > will not be able to free immediately, those fragmented zpool will need
> > to wait for compaction to free. it.
>
> It's been awhile since I last worked on zsmalloc, but IIRC zsmalloc
> handles these pages PAGE_SIZE sized object by just handing out a full
> page. I skimmed through zsmalloc code, and it seems like for this
> size_class, each zspage just contain one base page?

Yes, they have to be full size for the page class that contains the
full size. That is exactly what I expected. What I previously
mentioned the page might be fragmented and not able to release from
zpool is wrong, now you mention it due to the full size.  That is also
why I make the other point that the ideal cut off threshold can be
less than PAGE_SIZE. e.g. after compression 95% of a full page, you
might just as well store it as uncompressed. Those 5% will waste in
the class margin anyway.

>
> (check out the logic around ZsHugePage and
> calculate_zspage_chain_size() in mm/zsmalloc.c. Confusing name, I
> know).
>
> So we won't need compaction to get rid of these buffers for
> incompressible pages, at free time. There might still be some extra
> overhead (metadata), but at least I assume we can free these pages
> easily?

Maybe, or steal them into the swap cache.

> > We might still wait to evaluate the lost/gain vs store the
> > incompressible in swap cache. Google actually has an internal patch to
> > store the incompressible pages in swap cache and move them out of the
> > LRU to another already existing list. I can clean it up a bit and send
> > it to the list for comparison. This approach will not mess with the
> > zswap LRU because the incompressible data is not moving into zswap.
> > There is no page fault to handle the incompressible pages.
>
> We can always iterate to get the best of all worlds :) One objective at a time.
>
> BTW, May I ask - how/when do we move the incompressible pages out of
> the "incompressible LRU"? I believe there needs to be some sort of
> scanning mechanism to detect dirtying right?
>
> >
> > > slightly improve it for incompressible pages case by skipping the
> > > decompression.  Also, if we take this way, I think we should also need to make
> > > a good answer to the zswapped-page migrations etc.
> >
> > Yes, if we keep the store page in the zswap approach, we might have to
> > optimize inside zsmalloc to handle full page storing better.

Sure. Ack.

>
> I believe it already does - check my response above. But I can be wrong.
>

> > I slept over it a bit. Now I think we should make this a counter of
> > how many uncompressed pages count stored in zswap. Preperbelly as per
> > memcg counter.
>
> Actually, yeah I asked about this counter in a review in an earlier
> version as well, then I completely forgot about it :)

Ack.

>
>
> > I saw that you implement it as a counter in your V1. Does the zsmalloc
> > already track this information in the zspool class? Having this per
>
> Kinda sorta. If we build the kernel with CONFIG_ZSMALLOC_STAT, we can
> get the number of objects in each size_class.
>
> Each time we read, I believe we have to read every size class though.
> So it's kinda annoying. Whereas here, we can just read an atomic
> counter? :)
>

The data compressible or not is very dependent on the job. A per memcg
counter will be more useful than a global overall counter. As a
devil's advocate to challenge the per memcg incompressible counter. If
most of the data is not compressible, the per memcg overall
compression ratio, compressed size vs before compress size can reflect
that as well. Depending on how fine grain we want this data.

> > cgroup information we can evaluate how  much zswap is saving. Some
> > jobs like databases might store a lot of hash value and encrypted data
> > which is not compressible. In that case, we might just pass the whole
>
> Hmmm actually, this might be a good point. Lemme think about it a bit more.
>
> (but again, worst case scenario, we can send a follow up patch to add
> these counters).

Agree.

Thanks

Chris


Chris