linux-kernel - Re: [PATCH RFC] mm: ghost swapfile support for zswap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACePvbVGdx4YN=TCKuLF9TzCQg95OW8rnWLRrKENS42xc6q9cA@mail.gmail.com>
Date: Thu, 4 Dec 2025 14:11:57 +0400
From: Chris Li <chrisl@...nel.org>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Kairui Song <kasong@...cent.com>, 
	Kemeng Shi <shikemeng@...weicloud.com>, Nhat Pham <nphamcs@...il.com>, 
	Baoquan He <bhe@...hat.com>, Barry Song <baohua@...nel.org>, Johannes Weiner <hannes@...xchg.org>, 
	Chengming Zhou <chengming.zhou@...ux.dev>, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	pratmal@...gle.com, sweettea@...gle.com, gthelen@...gle.com, 
	weixugc@...gle.com
Subject: Re: [PATCH RFC] mm: ghost swapfile support for zswap

, t

On Thu, Dec 4, 2025 at 10:16 AM Yosry Ahmed <yosry.ahmed@...ux.dev> wrote:
> > On one hand I wish there was someone representing  the group as the
> > main speaker, that would make the discussion feel more equal, more
> > inclusive. On the other hand, any perspective is important, it is hard
> > to require the voice to route through the main speaker. It is hard to
> > execute in practice. So I give up suggesting that.  I am open for
> > suggestions on how to make the discussion more inclusive for newcomers
> > to the existing established group.
>
> Every person is expressing their own opinion, I don't think there's a
> way to change that or have a "representative" of each opinion. In fact,
> changing that would be the opposite of inclusive.

Ack, that is why I did not suggest a main speaker token approach. On
the other hand, there are still some considerations that can be taken
care of from the group side that do not overwhelm the single person if
a similar opinion has been expressed and is waiting for response. N vs
1 arguing does put the single person in unfair dis-advantage and
alienates the single person. We should consider the effect of that.
OK. Enough said on this and let's move on.

> > > Now with all of that out of the way, I want to try to salvage the
> > > technical discussion here. Taking several steps back, and
> >
> > Thank you for driving the discussion back to the technical side. I
> > really appreciate it.
> >
> > > oversimplifying a bit: Chris mentioned having a frontend and backend and
> > > an optional redirection when a page is moved between swap backends. This
> > > is conceptually the same as the virtual swap proposal.
> >
> > In my perspective, it is not the same as a virtual swap proposal.
> > There is some overlap, they both can do redirection.
> >
> > But they originally aim to solve two different problems. One of the
> > important goals of the swap table is  to allow continuing mTHP swap
> > entry when all the space left is not continues. For the rest of the
> > discusion we call it "continuous mTHP allocator". It allocate
> > continuous swap entry out of non continues file location.
> >
> > Let's say you have a 1G swapfile, all full not available slots.
> > 1) free 4 pages at swap offset 1, 3, 5, 7. All discontiguous spaces
> > add up to 16K.
> > 2) Now allocate one mTHP order 2, 16K in size.
> > Previous allocator can not be satisfied with this requirement. Because
> > the 4 empty slots are not contiguous.
> > Here the redirection and growth of the front swap entry comes in, it
> > is all part of the consideration all alone, not an afterthought.
> > This following step will allow allocating 16K continuous swap entries
> > out of offset [1,3,5,7]
> > 3) We grow the front end part of the swapfile, effectively bump up the
> > max size and add a new cluster of order 2, with a swap table.
> > That is where the front end of the swap and back end file store comes in.
>
> There's no reason why we cannot do the same with virtual swap, even if
> it wasn't the main motivaiton, I don't see why we can't achieve the same
> result.

Yes, they can. By largely copying the swap table approach to achieve
the same result. Before I point out the importance of the memory
overhead of per swap slot entry, the 48 bytes is not production
quality. VS hasn't really made good progress toward shrinking down the
per slot memory usage at a similar level. Not even close. That is
until you propose using the earlier stage of the swap table to compete
with the later stage of the swap table, by using the exact same
approach of the later stage of the swap table. Please don't use swap
table ideas to do a knockoff clone of swap table and take the final
credit. That is very not decent, I don't think that matches the
upstream spirit either. Please respect the originality of the idea and
give credit where it is due, after all, that is how the academic
system is built on.

> > BTW, Please don't accuse me copy cat the name "virtual swapfile". I
> > introduce it here 1/8/2025 before Nhat does:
>
> I don't think anyone cares about the actual names, or accused anyone of
> copycatting anything.

There are repeat projections cast on me as the "after thought", I want
the people who call me "after thought" acknowledge that I am the
"leading thought", the "original thought". Just joking.

> > https://lore.kernel.org/linux-mm/CACePvbX76veOLK82X-_dhOAa52n0OXA1GsFf3uv9asuArpoYLw@mail.gmail.com/
> > ==============quote==============
> > I think we need to have a separation of the swap cache and the backing
> > of IO of the swap file. I call it the "virtual swapfile".
> > It is virtual in two aspect:
> > 1) There is an up front size at swap on, but no up front allocation of
> > the vmalloc array. The array grows as needed.
> > 2) There is a virtual to physical swap entry mapping. The cost is 4
> > bytes per swap entry. But it will solve a lot of problems all
> > together.
> > ==============quote ends =========

The above prior write up nicely sums up the main idea behind VS, would
you agree?

I want to give Nhat the benefit of the doubt that he did not commit
plagiarism. Since now VS has changed strategy to clone swap tables
against swap tables. I would add the points that, please be decent and
be collaborative. Respect the originality of the ideas. If this is in
the academic context, the email sent to the list considers paper
submission, the VS paper would definitely get ding on not properly
citing priory paper of "virtual swapfile" above.

So far team VS haven't participated much on swap table development.
There are a few ack from Nhat, but there is not really any discussion
showing insight of understanding the swap table. Now VS wants to clone
the swap table against the swap table. Why not just join the team swap
table. Really take part of the review of swap table phase N, not just
rubber stamping. Please be collaborative, be decent, do it the proper
upstream way.


> > Correction. Instead of swapfile B, Backend location in swapfile B. in
> > step 5). It only 4 byte. The back end does not have swap cache. The
> > swap cache belong to front end A (8 bytes).
>
> Ack.

Thanks for the Ack.

> > Again, in my description swap slot A has a file backend location
> > pointer points to swapfile B.
> > It is only the bottom half the swap slot B, not the full swap slot. It
> > does not have 8 byte swap entry overhead of B.
>
> Ack.

Thanks for the Ack.

> > Point to another swapfile location backend, not swap entry.(4 bytes)
>
> Ack.

Thanks for the Ack.

> > > Disregarding any specific implementation, any space optimizations that
> > > can be applied to the swap table (e.g. combining swap count and
> > > swapcache in an 8 byte field) can also be applied to virtual swap. The
> > > only *real* difference is that with virtual swap we need to store the
> > > swap slot (or zswap entry), while for the current swap table proposal it
> > > is implied by the index of the entry. That's an additional 8 bytes.
> >
> > No, the VS has a smaller design scope. VS does not enable "continous
> > mTHP allocation" . At least that is not mention in any previous VS
> > material.
>
> Why not? Even if it wasn't specifically called out as part of the
> motivation, it still achieves that. What we need for the mTHP swap is to
> have a redirection layer. Both virtual swap or the front-end/back-end
> design achieve that.

Using your magic against you, that is what I call an "after thought"
of the century. Just joking.

Yes, you can do that, by cloning swap tables against swpa tables. It
is just not considered decent in my book. Please be collaborative. Now
I have demonstrated the swap table side is the one with most of the
original ideas and advanced technical designs. Please let the team
swap table finish up what they originally planned, not steal the
thunder at the final glory. If team VS wants to help speed up the
process, since priority is one of VS main considerations, now the
design has been converging to swap tables. Please help reviewing the
swap table landing phases submission. Crew, walk, run. Even if you
want to use the swap table against the swap table. Reviewing landing
swap table code is a good way to understand swap tables. Let the team
swap tables to finish up the original goal. Once swpa tables have the
continue mTHP allocator, we can example any other VS feature can be
added on top of that.

> > With the continues mTHP allocator mention above, it already has the
> > all things VS needed.
> > I am not sure we still need VS if we have "continues mTHP allocator",
> > that is TBD.
>
> As I mentioned above, I think the front-end/back-end swap tables and
> virtual swap are conceptually very similar. The more we discuss this the

Of course very similar, for all we know it is possible they come from
the same source.
https://lore.kernel.org/linux-mm/CACePvbX76veOLK82X-_dhOAa52n0OXA1GsFf3uv9asuArpoYLw@mail.gmail.com/

> more I am convinced about this tbh. In both cases we provide an
> indirection layer such that we can change the backend or backing
> swapfile without updating the page tables, and allow thing like mTHP
> swap without having contiguous slots in the swapfile.
>
> >
> > Yes, VS can reuse the physical location pointer by "continues mTHP allocator".
> >
> > The overhead is for above swap table of redirection is 12 bytes not 16 bytes.
>
> Honeslty if it boils down to 4 bytes per page, I think that's a really
> small difference.

4 bytes per slot entry difference is leaving free memory on the table.
Why not grab it?
Do you know that all those swap phase II..IV just to save 3 bytes per
slot (and clean up the code in the process)?
4 bytes out of total 8 or 12 bytes that is 33% - 50% difference on the
per solt usage.

> Especially that it doesn't apply to all cases (e.g.
> not the zswap-only case that Google currently uses).

I want to ask a clarifying question here. My understanding is that VS
is always on.
If we are doing zswap-only, does VS still have the 8+4 = 12 bytes overhead?

I want to make sure if we are not using the redirection, in the zswap
only case, we shouldn't pay the price for it.
Again that is another free money on the table.

> > > batches). In fact, I think we can use the swap table as the allocator in
> > > the virtual swap space, reusing all the locking and allocation

Yes, you can. Is there a technical difference to do so? If not, why
steal the thunder at finial glory? Why not let swap tables finish up
its course?

> >  In the "continous mTHP allocator" it is just physical location pointer,
> >
> > > Another important aspect here, in the simple case the swap table does
> > > have lower overhead than virtual swap (8 bytes vs 16 bytes). Although
> > > the difference isn't large to begin with, I don't think it's always the
> > > case. I think this is only true for the simple case of having a swapped
> > > out page on a disk swapfile or in a zswap (ghost) swapfile.
> >
> > Please redo your evaluation after reading the above "continuous mTHP alloctor".
>
> I did, and if anything I am more convinced that the designs are
> conceptually close. The main difference is that the virtual swap
> approach is more flexible in my opinion because the backend doesn't have
> to be a swapfile, and we don't need "ghost" to use zswap and manage it
> like a swapfile.

It seems the design has converged to the swap table side. Even the
"virtual swapfile" concept could have come from the swap table side.
I'm flattered, copying is the best compliment from the competitor.

Now we settle on the big design, the rest of the design difference is
very small.

Let's discuss the VS virtual swap interface without actual swapfile.

One question:
Does VS virtual swap file expose any swap file interface be referenced
by swap on/off? I assume no, please correct me if you do.

I think it could have downsides.

1) It is not compatible with normal /etc/fstab design. Now you need
seperate init script to enable disable VS.
2) It does not go through swap on/off path. That creates
complications. As we know we have a lot of bugs exposed in the swap
on/off. It is a very tricky business to get it right. I would
recommend staying away from cloning a separate path for the
swapon/off. The VS introduces a new kernel interface that also needs
to be maintained.
3) The customer can't swap files round robin. As we know some
companies are using multiple swap files to reduce the si->lock
contention. If I recall correctly 8 swapfiles. Forcing one virtual
swapfile will force go through the same si->locks has performance
penalties.
4) Having an overall virtual swap file imposes a design challenge in
swap.tiers world. Because it does not have a swapfile, the swapfile
priority does not apply.
5) Keep it simple. Using your magic against you, the ghost swapfile
conceptually can do whatever VS conceptually can do as well. You can
consider the ghost swapfile header is just a config file of the VS to
setup the swapfile. It saves the extra init script posted on users.

BTW, the "ghost swapfile" I will properly rename it back to "virtual
swapfile" in the code, as I earn that term's priority date. And you
don't mind what it is really called.

> > Again 12 bytes using "continues mTHP allocator" frame work.
>
> Ack.

Thanks for the Ack.

>
> >
> > > Now imagine a scenario where we have zswap, SSD, and HDD swapfiles with
> > > tiering. If a page goes to zswap, then SSD, then HDD, we'll end up with
> > > 3 swap table entries for a single swapped out page. That's 24 bytes. So
> > > the memory overhead is not really constant, it scales with the number of
> > > tiers (as opposed to virtual swap).
> >
> > Nope, Only one front swap entry remain the same, every time it write
> > to a different tier, it only update the back end physical location
> > pointer.
> > It always points to the finial physical location. Only 12 bytes total.
>
> Ack.

Thanks for the Ack. That confirms the swap table side is actually
having the more advanced technical design all alone.

> > Please read the above description regarding the front end and back end
> > split then ask your question again. The "continuous mTHP allocator"
> > above should answer your question.
>
> Yeah, the 8 bytes front-end and 4-bytes backend answer this.

Ack

> > > We can workaround this by having different types of clusters and swap
> > > tables, where "virtual" clusters have 16 bytes instead of 8 bytes per
> > > entry for that, sure.. but at that point we're at significantly more
> > > complexity to end up where virtual swap would have put us.
> >
> > No, that further complicating things. Please don't go there. The front
> > end and back end location split is design to simplify situation like
> > this. It is conceptual much cleaner as well.
>
> Yeah that was mostly hypothetical.

Ack.

>
> >
> > >
> > > Chris, Johannes, Nhat -- please correct me if I am wrong here or if I
> > > missed something. I think the current swap table work by Kairui is
> >
> > Yes, see the above explanation of the "continuous mTHP allocator".
> >
> > > great, and we can reuse it for virtual swap (as I mentioned above). But
> > > I don't think forcing everything to use a swapfile and extending swap
> > > tables to support indirections and frontend/backend split is the way to
> > > go (for the reasons described above).
> >
> > IMHO, it is the way to go if consider mTHP allocating. You have
> > different assumption than mine in my design, I correct your
> > description as much as I can above. I am interested in your opinion
> > after read the above description of "continuous mTHP allocator", which
> > is match the 2024 LSF talk slide regarding swap cache redirecting
> > physical locations.
>
> As I mentioned, I am still very much convinced the designs are
> conceptually very similar and the main difference is whether the
> "backend" is 4 bytes and points at a slot in a swapfile, or a generic
> 8-byte pointer.

Thanks, as I said earlier, I am flattered.

It is of course conceptually it is very close after you copy all the
internal design elements of the swap table approach.

> FWIW, we can use 4 bytes in virtual swap as well if we leave the xarray
> in zswap. 4 bytes is plenty of space for an index into the zswap xarray
> if we no longer use the swap offset. But if we use 8 bytes we can
> actually get rid of the zswap xarray, by merging it with the virtual
> swap xarray, or even stop using xarrays completely if we adopt the
> current swap table allocator for the virtual swap indexes.
>
> As Nhat mentioned earlier, I suspect we'll end up not using any extra
> overhead at all for the zswap-only case, or even reducing the current
> overhead.

In my design there is no extra xarray for zswap, you just have to take
my word for it now. That is very late in the game, finish the swap
table glory first.

Yosry, thank you for driving a good technical discussion. I really enjoy it.

I wish the beginning of the discussion went through this path instead.
The multi NACK first ask questions later and the condescending tone at
the beginning of the discussion really upsets me. The me alone facing
four round robin intense arguing doesn't help either. It makes me feel
I am not welcomed. I am short tempered and easily get triggered. I am
sorry for my behavior as well. Just give me a few moments and I will
come to my senses.

The ironic part of the discussion is that the "dead end" is the one
being converging to. The "afterthought" turns out to be "leading
thought". Let that be a lesson for everyone, me included. Be nice to
the people who hold different ideas than yourself.

Looking forward to more discussion like this.

Chris