linux-kernel - Re: [PATCH RFC] mm: ghost swapfile support for zswap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACePvbX=5m5VuXN2V2_DB1sbNw+tEu=BKBxnuEXmY0V+hQS2_w@mail.gmail.com>
Date: Thu, 4 Dec 2025 00:02:47 +0400
From: Chris Li <chrisl@...nel.org>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Kairui Song <kasong@...cent.com>, 
	Kemeng Shi <shikemeng@...weicloud.com>, Nhat Pham <nphamcs@...il.com>, 
	Baoquan He <bhe@...hat.com>, Barry Song <baohua@...nel.org>, Johannes Weiner <hannes@...xchg.org>, 
	Chengming Zhou <chengming.zhou@...ux.dev>, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	pratmal@...gle.com, sweettea@...gle.com, gthelen@...gle.com, 
	weixugc@...gle.com
Subject: Re: [PATCH RFC] mm: ghost swapfile support for zswap

On Wed, Dec 3, 2025 at 12:37 PM Yosry Ahmed <yosry.ahmed@...ux.dev> wrote:
>
> On Fri, Nov 21, 2025 at 01:31:43AM -0800, Chris Li wrote:
> > The current zswap requires a backing swapfile. The swap slot used
> > by zswap is not able to be used by the swapfile. That waste swapfile
> > space.
> >
> > The ghost swapfile is a swapfile that only contains the swapfile header
> > for zswap. The swapfile header indicate the size of the swapfile. There
> > is no swap data section in the ghost swapfile, therefore, no waste of
> > swapfile space.  As such, any write to a ghost swapfile will fail. To
> > prevents accidental read or write of ghost swapfile, bdev of
> > swap_info_struct is set to NULL. Ghost swapfile will also set the SSD
> > flag because there is no rotation disk access when using zswap.
> >
> > The zswap write back has been disabled if all swapfiles in the system
> > are ghost swap files.
> >
> > Signed-off-by: Chris Li <chrisl@...nel.org>
>
> I did not know which subthread to reply to at this point, so I am just
> replying to the main thread. I have been trying to stay out of this for
> various reasons, but I was mentioned a few times and I also think this
> is getting out of hand tbh.

Thanks for saving the discussion.

>
> First of all, I want to clarify that I am not "representing" any entity
> here, I am speaking as an upstream zswap maintainer. Obviously I have
> Google's interests in mind, but I am not representing Google here.

Ack, same here.

> Second, Chris keeps bringing up that the community picked and/or
> strongly favored the swap table approach over virtual swap back in 2023.
> I just want to make it absolutely clear that this was NOT my read of the
> room, and I do not think that the community really made a decision or
> favored any approach back then.

OK. Let's move on from that to our current discussion.

> Third, Chris, please stop trying to force this into a company vs company
> situation. You keep mentioning personal attacks, but you are making this
> personal more than anyone in this thread by taking this approach.

Let me clarify, it is absolutely not my intention to make it company
vs company, that does not fit the description either. Please accept my
apology for that. My original intention is that it is a group of
people sharing the same idea. More like I am against a whole group
(team VS). It is not about which company at all. Round robin N -> 1
intense arguing put me in an uncomfortable situation, feeling
excluded.

On one hand I wish there was someone representing  the group as the
main speaker, that would make the discussion feel more equal, more
inclusive. On the other hand, any perspective is important, it is hard
to require the voice to route through the main speaker. It is hard to
execute in practice. So I give up suggesting that.  I am open for
suggestions on how to make the discussion more inclusive for newcomers
to the existing established group.

> Now with all of that out of the way, I want to try to salvage the
> technical discussion here. Taking several steps back, and

Thank you for driving the discussion back to the technical side. I
really appreciate it.

> oversimplifying a bit: Chris mentioned having a frontend and backend and
> an optional redirection when a page is moved between swap backends. This
> is conceptually the same as the virtual swap proposal.

In my perspective, it is not the same as a virtual swap proposal.
There is some overlap, they both can do redirection.

But they originally aim to solve two different problems. One of the
important goals of the swap table is  to allow continuing mTHP swap
entry when all the space left is not continues. For the rest of the
discusion we call it "continuous mTHP allocator". It allocate
continuous swap entry out of non continues file location.

Let's say you have a 1G swapfile, all full not available slots.
1) free 4 pages at swap offset 1, 3, 5, 7. All discontiguous spaces
add up to 16K.
2) Now allocate one mTHP order 2, 16K in size.
Previous allocator can not be satisfied with this requirement. Because
the 4 empty slots are not contiguous.
Here the redirection and growth of the front swap entry comes in, it
is all part of the consideration all alone, not an afterthought.
This following step will allow allocating 16K continuous swap entries
out of offset [1,3,5,7]
3) We grow the front end part of the swapfile, effectively bump up the
max size and add a new cluster of order 2, with a swap table.
That is where the front end of the swap and back end file store comes in.

BTW, Please don't accuse me copy cat the name "virtual swapfile". I
introduce it here 1/8/2025 before Nhat does:
https://lore.kernel.org/linux-mm/CACePvbX76veOLK82X-_dhOAa52n0OXA1GsFf3uv9asuArpoYLw@mail.gmail.com/
==============quote==============
I think we need to have a separation of the swap cache and the backing
of IO of the swap file. I call it the "virtual swapfile".
It is virtual in two aspect:
1) There is an up front size at swap on, but no up front allocation of
the vmalloc array. The array grows as needed.
2) There is a virtual to physical swap entry mapping. The cost is 4
bytes per swap entry. But it will solve a lot of problems all
together.
==============quote ends =========
Side story:
I want to pass the "virtual swapfile"  for Kairui to propose as LSF
topic. Coincidentally  Nhat proposes the virtual swap as a LSF topic
at 1/16/2025, a few days after I mention "virtual swapfile" in the lsf
topic related discussion. It is right before Kairui purpose "virtual
swapfile". Kairui renamed our version as "swap table". That is the
history behind the name of "swap table".
https://lore.kernel.org/linux-mm/20250116092254.204549-1-nphamcs@gmail.com/

I am sure Nhat did not see that email and come up with it
independently, coincidentally. I just want to establish that I have
prior art introducing the name "virtual swapfile" before Nhat's LSF
"virtual swap" topic. After all, it is just a name. I am just as happy
using "swap table".

To avoid confuse the reader I will call my version of "virtual swap"
the "front end".

The front end owns the cluster and swap table (swap cache). 8 bytes.
The back end only contain file position pointer. 4 bytes.

4) The back end will need different allocate because the allocating
assumption is different, it does not have alignment requirement. It
just need to track which block location is available.
It will need to have a back end specific allocator.  It only manage
the location of the swapfile cannot allocate from fronted. e.g.
redirection entry create a hole. or the new cluster added from step 3.

5) the backend location pointer is optional of the cluster. For the
cluster new allocated from step, It must have location pointer,
because its offset is out of the backing file range.
That is a 4 byte just like a swap entry.
This backend location pointer can be used by solution like VS as well.
That is part of the consideration as well, so not a after thought.
The allocator mention here is more like a file system design rather
than pure memory location, because it need to consider block location
for combining block level IO.

So the mTHP allocator can do swapfile location redirection. But that
is a side benefit of a different design goal (mTHP allocation). This
physical location pointer description match my 2024 LSF pony talk
slide. I just did not put text in the slide there. So it is not some
thing after thought, it pre-dates back to 2024 talks.

> I think the key difference here is:
> - In Chris's proposal, we start with a swap entry that represents a swap
>   slot in swapfile A. If we do writeback (or swap tiering), we create
>   another swap entry in swapfile B, and have the first swap entry point

Correction. Instead of swapfile B, Backend location in swapfile B. in
step 5). It only 4 byte. The back end does not have swap cache. The
swap cache belong to front end A (8 bytes).

>   to it instead of the slot in swapfile A. If we want to reuse the swap
>   slot in swapfile A, we create a new swap entry that points to it.
>
>   So we start with a swap entry that directly maps to a swap slot, and

Again, in my description swap slot A has a file backend location
pointer points to swapfile B.
It is only the bottom half the swap slot B, not the full swap slot. It
does not have 8 byte swap entry overhead of B.

>   optionally put a redirection there to point to another swap slot for
>   writeback/tiering.

Point to another swapfile location backend, not swap entry.(4 bytes)

>   Everything is a swapfile, even zswap will need to be represented by a
>   separate (ghost) swapfile.

Allow ghost swapfile. I wouldn't go as far saying ban the current
zswap writeback, that part is TBD. My description is enable memory
swap tiers without actual physical file backing. Enable ghost
swapfile.

>
> - In the virtual swap proposal, swap entries are in a completely
>   different space than swap slots. A swap entry points to an arbitrary
>   swap slot (or zswap entry) from the beginning, and writeback (or
>   tiering) does not change that, it only changes what is being pointed
>   to.
>
> Regarding memory overhead (assuming x86_64), Chris's proposal has 8
> bytes per entry in the swap table that is used to hold both the swap
> count as well as the swapcache or shadow entry. Nhat's RFC for virtual
Ack

> swap had 48 bytes of overhead, but that's a PoC of a specific
> implementaiton.

Ack.

> Disregarding any specific implementation, any space optimizations that
> can be applied to the swap table (e.g. combining swap count and
> swapcache in an 8 byte field) can also be applied to virtual swap. The
> only *real* difference is that with virtual swap we need to store the
> swap slot (or zswap entry), while for the current swap table proposal it
> is implied by the index of the entry. That's an additional 8 bytes.

No, the VS has a smaller design scope. VS does not enable "continous
mTHP allocation" . At least that is not mention in any previous VS
material.

> So I think a fully optimized implementation of virtual swap could end up
> with an overhead of 16 bytes per-entry. Everything else (locks,
> rcu_head, etc) can probably be optimized away by using similar
> optimizations as the swap table (e.g. do locking and alloc/freeing in

With the continues mTHP allocator mention above, it already has the
all things VS needed.
I am not sure we still need VS if we have "continues mTHP allocator",
that is TBD.

Yes, VS can reuse the physical location pointer by "continues mTHP allocator".

The overhead is for above swap table of redirection is 12 bytes not 16 bytes.

> batches). In fact, I think we can use the swap table as the allocator in
> the virtual swap space, reusing all the locking and allocation

That is my feel all alone. Let swap table manage that.

> optimizations. The difference would be that the swap table is indexed by
> the virtual swap ID rather than the swap slot index.

 In the "continous mTHP allocator" it is just physical location pointer,

> Another important aspect here, in the simple case the swap table does
> have lower overhead than virtual swap (8 bytes vs 16 bytes). Although
> the difference isn't large to begin with, I don't think it's always the
> case. I think this is only true for the simple case of having a swapped
> out page on a disk swapfile or in a zswap (ghost) swapfile.

Please redo your evaluation after reading the above "continuous mTHP alloctor".

> Once a page is written back from zswap to disk swapfile, in the swap
> table approach we'll have two swap table entries. One in the ghost

One one entry with back end location pointer. (12 byte)

> swapfile (with a redirection), and one in the disk swapfile. That's 16
> bytes, equal to the overhead of virtual swap.

Again 12 bytes using "continues mTHP allocator" frame work.

> Now imagine a scenario where we have zswap, SSD, and HDD swapfiles with
> tiering. If a page goes to zswap, then SSD, then HDD, we'll end up with
> 3 swap table entries for a single swapped out page. That's 24 bytes. So
> the memory overhead is not really constant, it scales with the number of
> tiers (as opposed to virtual swap).

Nope, Only one front swap entry remain the same, every time it write
to a different tier, it only update the back end physical location
pointer.
It always points to the finial physical location. Only 12 bytes total.

You are paying 24 bytes because you don't have the front end vs back end split.
Your redirection includes the front end 8 byte as well. Because you
include the front end, now you need to do the relay forward.
That is the benefit to have front end and back end split of the swap
file. Make it more like a file system design.

> Another scenario is where we have SSD and HDD swapfiles with tiering. If
> a page starts in SSD and goes to HDD, we'll have to swap table entries
> for it (as above). The SSD entry would be wasted (has a redirection),
> but Chris mentioned that we can fix this by allocating another frontend
> cluster that points at the same SSD slot. How does this fit in the

No a fix. It is in the design consideration all alone. When the
redirection happen, that underlying physical block location pointer
will add to the backend allocator. The backend don't overlap with swap
entry location can be allocated from front end.

> 8-byte swap table entry tho? The 8-bytes can only hold the swapcache or
> shadow (and swapcount), but not the swap slot. For the current
> implementation, the slot is implied by the swap table index, but if we
> have separate front end swap tables, then we'll also need to store the
> actual slot.

Please read the above description regarding the front end and back end
split then ask your question again. The "continuous mTHP allocator"
above should answer your question.

> We can workaround this by having different types of clusters and swap
> tables, where "virtual" clusters have 16 bytes instead of 8 bytes per
> entry for that, sure.. but at that point we're at significantly more
> complexity to end up where virtual swap would have put us.

No, that further complicating things. Please don't go there. The front
end and back end location split is design to simplify situation like
this. It is conceptual much cleaner as well.

>
> Chris, Johannes, Nhat -- please correct me if I am wrong here or if I
> missed something. I think the current swap table work by Kairui is

Yes, see the above explanation of the "continuous mTHP allocator".

> great, and we can reuse it for virtual swap (as I mentioned above). But
> I don't think forcing everything to use a swapfile and extending swap
> tables to support indirections and frontend/backend split is the way to
> go (for the reasons described above).

IMHO, it is the way to go if consider mTHP allocating. You have
different assumption than mine in my design, I correct your
description as much as I can above. I am interested in your opinion
after read the above description of "continuous mTHP allocator", which
is match the 2024 LSF talk slide regarding swap cache redirecting
physical locations.

Chris