[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2a8fd7bd35939b9aa4a7267c93e1fda995137966@linux.dev>
Date: Mon, 24 Nov 2025 19:32:46 +0000
From: "Yosry Ahmed" <yosry.ahmed@...ux.dev>
To: "Johannes Weiner" <hannes@...xchg.org>
Cc: "Chris Li" <chrisl@...nel.org>, "Andrew Morton"
<akpm@...ux-foundation.org>, "Kairui Song" <kasong@...cent.com>, "Kemeng
Shi" <shikemeng@...weicloud.com>, "Nhat Pham" <nphamcs@...il.com>,
"Baoquan He" <bhe@...hat.com>, "Barry Song" <baohua@...nel.org>,
"Chengming Zhou" <chengming.zhou@...ux.dev>, linux-mm@...ck.org, "Rik van
Riel" <riel@...riel.com>, linux-kernel@...r.kernel.org,
pratmal@...gle.com, sweettea@...gle.com, gthelen@...gle.com,
weixugc@...gle.com
Subject: Re: [PATCH RFC] mm: ghost swapfile support for zswap
On Mon, Nov 24, 2025 at 12:27:17PM -0500, Johannes Weiner wrote:
> On Fri, Nov 21, 2025 at 05:52:09PM -0800, Chris Li wrote:
> > On Fri, Nov 21, 2025 at 3:40 AM Johannes Weiner <hannes@...xchg.org> wrote:
> > >
> > > On Fri, Nov 21, 2025 at 01:31:43AM -0800, Chris Li wrote:
> > > > The current zswap requires a backing swapfile. The swap slot used
> > > > by zswap is not able to be used by the swapfile. That waste swapfile
> > > > space.
> > > >
> > > > The ghost swapfile is a swapfile that only contains the swapfile header
> > > > for zswap. The swapfile header indicate the size of the swapfile. There
> > > > is no swap data section in the ghost swapfile, therefore, no waste of
> > > > swapfile space. As such, any write to a ghost swapfile will fail. To
> > > > prevents accidental read or write of ghost swapfile, bdev of
> > > > swap_info_struct is set to NULL. Ghost swapfile will also set the SSD
> > > > flag because there is no rotation disk access when using zswap.
> > >
> > > Zswap is primarily a compressed cache for real swap on secondary
> > > storage. It's indeed quite important that entries currently in zswap
> > > don't occupy disk slots; but for a solution to this to be acceptable,
> > > it has to work with the primary usecase and support disk writeback.
> >
> > Well, my plan is to support the writeback via swap.tiers.
>
> Do you have a link to that proposal?
>
> My understanding of swap tiers was about grouping different swapfiles
> and assigning them to cgroups. The issue with writeback is relocating
> the data that a swp_entry_t page table refers to - without having to
> find and update all the possible page tables. I'm not sure how
> swap.tiers solve this problem.
>
> > > This direction is a dead-end. Please take a look at Nhat's swap
> > > virtualization patches. They decouple zswap from disk geometry, while
> > > still supporting writeback to an actual backend file.
> >
> > Yes, there are many ways to decouple zswap from disk geometry, my swap
> > table + swap.tiers design can do that as well. I have concerns about
> > swap virtualization in the aspect of adding another layer of memory
> > overhead addition per swap entry and CPU overhead of extra xarray
> > lookup. I believe my approach is technically superior and cleaner.
> > Both faster and cleaner. Basically swap.tiers + VFS like swap read
> > write page ops. I will let Nhat clarify the performance and memory
> > overhead side of the swap virtualization.
>
> I'm happy to discuss it.
>
> But keep in mind that the swap virtualization idea is a collaborative
> product of quite a few people with an extensive combined upstream
> record. Quite a bit of thought has gone into balancing static vs
> runtime costs of that proposal. So you'll forgive me if I'm a bit
> skeptical of the somewhat grandiose claims of one person that is new
> to upstream development.
>
> As to your specific points - we use xarray lookups in the page cache
> fast path. It's a bold claim to say this would be too much overhead
> during swapins.
>
> Two, it's not clear to me how you want to make writeback efficient
> *without* any sort of swap entry redirection. Walking all relevant
> page tables is expensive; and you have to be able to find them first.
>
> If you're talking about a redirection array as opposed to a tree -
> static sizing of the compressed space is also a no-go. Zswap
> utilization varies *widely* between workloads and different workload
> combinations. Further, zswap consumes the same fungible resource as
> uncompressed memory - there is really no excuse to burden users with
> static sizing questions about this pool.
I think what Chris's idea is (and Chris correct me if I am wrong), is
that we use ghost swapfiles (that are not backed by disk space) for
zswap. So zswap has its own swapfiles, separate from disk swapfiles.
memory.tiers establishes the ordering between swapfiles, so you put
"ghost" -> "real" to get today's zswap writeback behavior. When you
writeback, you keep page tables pointing at the swap entry in the ghost
swapfile. What you do is:
- Allocate a new swap entry in the "real" swapfile.
- Update the swap table of the "ghost" swapfile to point at the swap
entry in the "real" swapfile, reusing the pointer used for the
swapcache.
Then, on swapin, you read the swap table of the "ghost" swapfile, find
the redirection, and read to the swap table of the "real" swapfile, then
read the page from disk into the swap cache. The redirection in the
"ghost" swapfile will keep existing, wasting that slot, until all
references to it are dropped.
I think this might work for this specific use case, with less overhead
than the xarray. BUT there are a few scenarios that are not covered
AFAICT:
- You still need to statically size the ghost swapfiles and their
overheads.
- Wasting a slot in the ghost swapfile for the redirection. This
complicates static provisioning a bit, because you have to account for
entries that will be in zswap as well as writtenback. Furthermore,
IIUC swap.tiers is intended to be generic and cover other use cases
beyond zswap like SSD -> HDD. For that, I think wasting a slot in the
SSD when we writeback to the HDD is a much bigger problem.
- We still cannot do swapoff efficiently as we need to walk the page
tables (and some swap tables) to find and swapin all entries in a
swapfile. Not as important as other things, but worth mentioning.
Chris please let me know if I didn't get this right.
Powered by blists - more mailing lists