linux-kernel - Re: [PATCH RFC] mm: ghost swapfile support for zswap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=MMgOb+7tDxPXDCcBxCEztvajDa0YfLBqgGuAr8GQ4s1A@mail.gmail.com>
Date: Fri, 28 Nov 2025 12:46:17 -0800
From: Nhat Pham <nphamcs@...il.com>
To: Chris Li <chrisl@...nel.org>
Cc: Rik van Riel <riel@...riel.com>, Johannes Weiner <hannes@...xchg.org>, 
	Andrew Morton <akpm@...ux-foundation.org>, Kairui Song <kasong@...cent.com>, 
	Kemeng Shi <shikemeng@...weicloud.com>, Baoquan He <bhe@...hat.com>, 
	Barry Song <baohua@...nel.org>, Yosry Ahmed <yosry.ahmed@...ux.dev>, 
	Chengming Zhou <chengming.zhou@...ux.dev>, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	pratmal@...gle.com, sweettea@...gle.com, gthelen@...gle.com, 
	weixugc@...gle.com
Subject: Re: [PATCH RFC] mm: ghost swapfile support for zswap

On Thu, Nov 27, 2025 at 11:10 AM Chris Li <chrisl@...nel.org> wrote:
>
> On Thu, Nov 27, 2025 at 6:28 AM Rik van Riel <riel@...riel.com> wrote:
> >
> > Sorry, I am talking about upstream.
>
> So far I have not had a pleasant upstream experience when submitting
> this particular patch to upstream.
>
> > I really appreciate anybody participating in Linux
> > kernel development. Linux is good because different
> > people bring different perspectives to the table.
>
> Of course everybody is welcome. However, NACK without technical
> justification is very bad for upstream development. I can't imagine
> what a new hacker would think after going through what I have gone
> through for this patch. He/she will likely quit contributing upstream.
> This is not the kind of welcome we want.
>
> Nhat needs to be able to technically justify his NACK as a maintainer.
> Sorry there is no other way to sugar coat it.

I am NOT the only zswap maintainer who expresses concerns. Other
people also have their misgivings, so I have let them speak and not
put words in their mouths.

But since you have repeatedly singled me out, I will repeat my concerns here:

1. I don't like the operational overhead (to statically size the zswap
swapfile size for each <host x workload> combination) of static
swapfile. Misspecification of swapfile size can lead to unacceptable
swap metadata overhead on small machines, or underutilization of zswap
on big machines. And it is *impossible* to know how much zswap will be
needed ahead of time, even if we fix host - it depends on workloads
access patterns, memory compressibility, and latency/memory pressure
tolerance.

2. I don't like the maintainer's overhead (to support a special
infrastructure for a very specific use case, i.e no-writeback),
especially since I'm not convinced this can be turned into a general
architecture. See below.

3. I want to move us towards a more dynamic architecture for zswap.
This is a step in the WRONG direction.

4. I don't believe this buys us anything we can't already do with
userspace hacking. Again, zswap-over-zram (or insert whatever RAM-only
swap option here), with writeback disabled, is 2-3 lines of script.

I believe I already justified myself well enough :) It is you who have
not really convinced me that this is, at the very least, a
temporary/first step towards a long-term generalized architecture for
zswap. Every time we pointed out an issue, you seem to justify it with
some more vague ideas that deepen the confusion.

Let's recap the discussion so far:

1. We claimed that this architecture is hard to extend for efficient
zswap writeback, or backend transfer in general, without incurring
page table updates. You claim you plan to implement a redirection
entry to solve this.

2. We then pointed out that inserting redirect entry into the current
physical swap infrastructure will leave holes in the upper swap tier's
address space, which is arguably *worse* than the current status quo
of zswap occupying disk swap space. Again, you pull out some vague
ideas about "frontend" and "backend" swap, which, frankly, is
conceptually very similar to swap virtualization.

3. The dynamicization of swap space is treated with the same rigor
(or, more accurately, lack thereof). Just more handwaving about the
"frontend" vs "backend" (which, again, is very close to swap
virtualization). This requirement is a deal breaker for me - see
requirement 1 above again.

4. We also pointed out your lack of thoughts for swapoff optimization,
which again, seem to be missing in your design. Again, more vagueness
about rmap, which is probably more overhead.

Look man, I'm not being hostile to you. Believe me on this - I respect
your opinion, and I'm working very hard on reducing memory overhead
for virtual swap, to see if I can meet you where you want it to be.
The RFC's original design inefficient memory usage was due to:

a) Readability. Space optimization can make it hard to read code, when
fields are squeezed into the same int/long variable. So I just put one
different field for each piece of metadata information

b) I was playing with synchronization optimization, i.e using atomics
instead of locks, and using per-entry locks. But I can go back to
using per-cluster lock (I haven't implemented cluster allocator at the
time of the RFC, but in my latest version I have done it), which will
further reduce the memory overhead by removing a couple of
fields/packing more fields.

The only non-negotiable per-swap-entry overhead will be a field to
indicate the backend location (physical swap slot, zswap entry, etc.)
+ 2 bits to indicate the swap type. With some field union-ing magic,
or pointer tagging magic, we can perhaps squeeze it even harder.

I'm also working on reducing the CPU overhead - re-partitioning swap
architectures (swap cache, zswap tree), reducing unnecessary xarray
lookups where possible.

We can then benchmark, and attempt to optimize it together as a community.