lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACePvbXLdf9bC3Juj=wA4_TKeH6XUgHbbvwV5HhNQtgoT3CiBg@mail.gmail.com>
Date: Mon, 24 Nov 2025 21:24:18 +0300
From: Chris Li <chrisl@...nel.org>
To: Johannes Weiner <hannes@...xchg.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Kairui Song <kasong@...cent.com>, 
	Kemeng Shi <shikemeng@...weicloud.com>, Nhat Pham <nphamcs@...il.com>, 
	Baoquan He <bhe@...hat.com>, Barry Song <baohua@...nel.org>, Yosry Ahmed <yosry.ahmed@...ux.dev>, 
	Chengming Zhou <chengming.zhou@...ux.dev>, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	pratmal@...gle.com, sweettea@...gle.com, gthelen@...gle.com, 
	weixugc@...gle.com
Subject: Re: [PATCH RFC] mm: ghost swapfile support for zswap

On Mon, Nov 24, 2025 at 8:27 PM Johannes Weiner <hannes@...xchg.org> wrote:
>
> On Fri, Nov 21, 2025 at 05:52:09PM -0800, Chris Li wrote:
> > On Fri, Nov 21, 2025 at 3:40 AM Johannes Weiner <hannes@...xchg.org> wrote:
> > >
> > > On Fri, Nov 21, 2025 at 01:31:43AM -0800, Chris Li wrote:
> > > > The current zswap requires a backing swapfile. The swap slot used
> > > > by zswap is not able to be used by the swapfile. That waste swapfile
> > > > space.
> > > >
> > > > The ghost swapfile is a swapfile that only contains the swapfile header
> > > > for zswap. The swapfile header indicate the size of the swapfile. There
> > > > is no swap data section in the ghost swapfile, therefore, no waste of
> > > > swapfile space.  As such, any write to a ghost swapfile will fail. To
> > > > prevents accidental read or write of ghost swapfile, bdev of
> > > > swap_info_struct is set to NULL. Ghost swapfile will also set the SSD
> > > > flag because there is no rotation disk access when using zswap.
> > >
> > > Zswap is primarily a compressed cache for real swap on secondary
> > > storage. It's indeed quite important that entries currently in zswap
> > > don't occupy disk slots; but for a solution to this to be acceptable,
> > > it has to work with the primary usecase and support disk writeback.
> >
> > Well, my plan is to support the writeback via swap.tiers.
>
> Do you have a link to that proposal?

My 2024 LSF swap pony talk already has a mechanism to redirect page
cache swap entries to different physical locations.
That can also work for redirecting swap entries in different swapfiles.

https://lore.kernel.org/linux-mm/CANeU7QnPsTouKxdK2QO8Opho6dh1qMGTox2e5kFOV8jKoEJwig@mail.gmail.com/

> My understanding of swap tiers was about grouping different swapfiles
> and assigning them to cgroups. The issue with writeback is relocating
> the data that a swp_entry_t page table refers to - without having to
> find and update all the possible page tables. I'm not sure how
> swap.tiers solve this problem.

swap.tiers is part of the picture. You are right the LPC topic mostly
covers the per cgroup portion. The VFS swap ops are my two slides of
the LPC 2023. You read from one swap file and write to another swap
file with a new swap entry allocated.

> > > This direction is a dead-end. Please take a look at Nhat's swap
> > > virtualization patches. They decouple zswap from disk geometry, while
> > > still supporting writeback to an actual backend file.
> >
> > Yes, there are many ways to decouple zswap from disk geometry, my swap
> > table + swap.tiers design can do that as well. I have concerns about
> > swap virtualization in the aspect of adding another layer of memory
> > overhead addition per swap entry and CPU overhead of extra xarray
> > lookup. I believe my approach is technically superior and cleaner.
> > Both faster and cleaner. Basically swap.tiers + VFS like swap read
> > write page ops. I will let Nhat clarify the performance and memory
> > overhead side of the swap virtualization.
>
> I'm happy to discuss it.
>
> But keep in mind that the swap virtualization idea is a collaborative
> product of quite a few people with an extensive combined upstream
> record. Quite a bit of thought has gone into balancing static vs
> runtime costs of that proposal. So you'll forgive me if I'm a bit
> skeptical of the somewhat grandiose claims of one person that is new
> to upstream development.

Collaborating with which companies developers? How many VS patches
landed in the kernel? I am also collaborating with different
developers, cluster base swap allocators, swap table phase I. Removing
the NUMA node swap file priority. Those are all suggested by me.

> As to your specific points - we use xarray lookups in the page cache
> fast path. It's a bold claim to say this would be too much overhead
> during swapins.

Yes, we just get rid of xarray in swap cache lookup and get some
performance gain from it.
You are saying one extra xarray is no problem, can your team demo some
performance number of impact of the extra xarray lookup in VS? Just
run some swap benchmarks and share the result.

We can do a test right now, without writing back to another SSD, The
ghosts swapfile compare with VS for zswap only case.

> Two, it's not clear to me how you want to make writeback efficient
> *without* any sort of swap entry redirection. Walking all relevant
> page tables is expensive; and you have to be able to find them first.

Swap cache can have a physical location redirection, see my 2024 LPC
slides. I have considered that way before the VS discussion.
https://lore.kernel.org/linux-mm/CANeU7QnPsTouKxdK2QO8Opho6dh1qMGTox2e5kFOV8jKoEJwig@mail.gmail.com/

> If you're talking about a redirection array as opposed to a tree -
> static sizing of the compressed space is also a no-go. Zswap
> utilization varies *widely* between workloads and different workload
> combinations. Further, zswap consumes the same fungible resource as
> uncompressed memory - there is really no excuse to burden users with
> static sizing questions about this pool.

I do see the swap table + swap.ters + swap ops and do better. We can
test the memory only case right now. To head to head test the VS and
swap.tiers on the writeback case will need to wait a bit. Swap table
is only reviewing phase II.

I mean CPU and per swap entry overhead.

I care less on who's idea it is, I care more about the end result
performance in (memory & CPU). I want the best idea/implementation to
win.

Chris

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ