linux-kernel - Re: [PATCH RFC] mm: ghost swapfile support for zswap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <gkbiqbml5ikvmqybklyzsff46gemhtczm374f4qx54y5glagru@elgd2tfmoezt>
Date: Wed, 3 Dec 2025 08:37:01 +0000
From: Yosry Ahmed <yosry.ahmed@...ux.dev>
To: Chris Li <chrisl@...nel.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>, 
	Kairui Song <kasong@...cent.com>, Kemeng Shi <shikemeng@...weicloud.com>, 
	Nhat Pham <nphamcs@...il.com>, Baoquan He <bhe@...hat.com>, Barry Song <baohua@...nel.org>, 
	Johannes Weiner <hannes@...xchg.org>, Chengming Zhou <chengming.zhou@...ux.dev>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, pratmal@...gle.com, sweettea@...gle.com, gthelen@...gle.com, 
	weixugc@...gle.com
Subject: Re: [PATCH RFC] mm: ghost swapfile support for zswap

On Fri, Nov 21, 2025 at 01:31:43AM -0800, Chris Li wrote:
> The current zswap requires a backing swapfile. The swap slot used
> by zswap is not able to be used by the swapfile. That waste swapfile
> space.
> 
> The ghost swapfile is a swapfile that only contains the swapfile header
> for zswap. The swapfile header indicate the size of the swapfile. There
> is no swap data section in the ghost swapfile, therefore, no waste of
> swapfile space.  As such, any write to a ghost swapfile will fail. To
> prevents accidental read or write of ghost swapfile, bdev of
> swap_info_struct is set to NULL. Ghost swapfile will also set the SSD
> flag because there is no rotation disk access when using zswap.
> 
> The zswap write back has been disabled if all swapfiles in the system
> are ghost swap files.
> 
> Signed-off-by: Chris Li <chrisl@...nel.org>

I did not know which subthread to reply to at this point, so I am just
replying to the main thread. I have been trying to stay out of this for
various reasons, but I was mentioned a few times and I also think this
is getting out of hand tbh.

First of all, I want to clarify that I am not "representing" any entity
here, I am speaking as an upstream zswap maintainer. Obviously I have
Google's interests in mind, but I am not representing Google here.

Second, Chris keeps bringing up that the community picked and/or
strongly favored the swap table approach over virtual swap back in 2023.
I just want to make it absolutely clear that this was NOT my read of the
room, and I do not think that the community really made a decision or
favored any approach back then.

Third, Chris, please stop trying to force this into a company vs company
situation. You keep mentioning personal attacks, but you are making this
personal more than anyone in this thread by taking this approach.

Now with all of that out of the way, I want to try to salvage the
technical discussion here. Taking several steps back, and
oversimplifying a bit: Chris mentioned having a frontend and backend and
an optional redirection when a page is moved between swap backends. This
is conceptually the same as the virtual swap proposal.

I think the key difference here is:
- In Chris's proposal, we start with a swap entry that represents a swap
  slot in swapfile A. If we do writeback (or swap tiering), we create
  another swap entry in swapfile B, and have the first swap entry point
  to it instead of the slot in swapfile A. If we want to reuse the swap
  slot in swapfile A, we create a new swap entry that points to it.

  So we start with a swap entry that directly maps to a swap slot, and
  optionally put a redirection there to point to another swap slot for
  writeback/tiering.

  Everything is a swapfile, even zswap will need to be represented by a
  separate (ghost) swapfile.

- In the virtual swap proposal, swap entries are in a completely
  different space than swap slots. A swap entry points to an arbitrary
  swap slot (or zswap entry) from the beginning, and writeback (or
  tiering) does not change that, it only changes what is being pointed
  to.

Regarding memory overhead (assuming x86_64), Chris's proposal has 8
bytes per entry in the swap table that is used to hold both the swap
count as well as the swapcache or shadow entry. Nhat's RFC for virtual
swap had 48 bytes of overhead, but that's a PoC of a specific
implementaiton.

Disregarding any specific implementation, any space optimizations that
can be applied to the swap table (e.g. combining swap count and
swapcache in an 8 byte field) can also be applied to virtual swap. The
only *real* difference is that with virtual swap we need to store the
swap slot (or zswap entry), while for the current swap table proposal it
is implied by the index of the entry. That's an additional 8 bytes.

So I think a fully optimized implementation of virtual swap could end up
with an overhead of 16 bytes per-entry. Everything else (locks,
rcu_head, etc) can probably be optimized away by using similar
optimizations as the swap table (e.g. do locking and alloc/freeing in
batches). In fact, I think we can use the swap table as the allocator in
the virtual swap space, reusing all the locking and allocation
optimizations. The difference would be that the swap table is indexed by
the virtual swap ID rather than the swap slot index.

Another important aspect here, in the simple case the swap table does
have lower overhead than virtual swap (8 bytes vs 16 bytes). Although
the difference isn't large to begin with, I don't think it's always the
case. I think this is only true for the simple case of having a swapped
out page on a disk swapfile or in a zswap (ghost) swapfile.

Once a page is written back from zswap to disk swapfile, in the swap
table approach we'll have two swap table entries. One in the ghost
swapfile (with a redirection), and one in the disk swapfile. That's 16
bytes, equal to the overhead of virtual swap.

Now imagine a scenario where we have zswap, SSD, and HDD swapfiles with
tiering. If a page goes to zswap, then SSD, then HDD, we'll end up with
3 swap table entries for a single swapped out page. That's 24 bytes. So
the memory overhead is not really constant, it scales with the number of
tiers (as opposed to virtual swap).

Another scenario is where we have SSD and HDD swapfiles with tiering. If
a page starts in SSD and goes to HDD, we'll have to swap table entries
for it (as above). The SSD entry would be wasted (has a redirection),
but Chris mentioned that we can fix this by allocating another frontend
cluster that points at the same SSD slot. How does this fit in the
8-byte swap table entry tho? The 8-bytes can only hold the swapcache or
shadow (and swapcount), but not the swap slot. For the current
implementation, the slot is implied by the swap table index, but if we
have separate front end swap tables, then we'll also need to store the
actual slot.

We can workaround this by having different types of clusters and swap
tables, where "virtual" clusters have 16 bytes instead of 8 bytes per
entry for that, sure.. but at that point we're at significantly more
complexity to end up where virtual swap would have put us.

Chris, Johannes, Nhat -- please correct me if I am wrong here or if I
missed something. I think the current swap table work by Kairui is
great, and we can reuse it for virtual swap (as I mentioned above). But
I don't think forcing everything to use a swapfile and extending swap
tables to support indirections and frontend/backend split is the way to
go (for the reasons described above).