[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKEwX=Pq=9nLb+SrTXkBWH2yyoYzzOSJqdeASweFh+EpEokKzg@mail.gmail.com>
Date: Mon, 24 Nov 2025 12:24:29 -0800
From: Nhat Pham <nphamcs@...il.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Johannes Weiner <hannes@...xchg.org>, Chris Li <chrisl@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>, Kairui Song <kasong@...cent.com>,
Kemeng Shi <shikemeng@...weicloud.com>, Baoquan He <bhe@...hat.com>,
Barry Song <baohua@...nel.org>, Chengming Zhou <chengming.zhou@...ux.dev>, linux-mm@...ck.org,
Rik van Riel <riel@...riel.com>, linux-kernel@...r.kernel.org, pratmal@...gle.com,
sweettea@...gle.com, gthelen@...gle.com, weixugc@...gle.com
Subject: Re: [PATCH RFC] mm: ghost swapfile support for zswap
On Mon, Nov 24, 2025 at 11:32 AM Yosry Ahmed <yosry.ahmed@...ux.dev> wrote:
>
> On Mon, Nov 24, 2025 at 12:27:17PM -0500, Johannes Weiner wrote:
> > On Fri, Nov 21, 2025 at 05:52:09PM -0800, Chris Li wrote:
> > > On Fri, Nov 21, 2025 at 3:40 AM Johannes Weiner <hannes@...xchg.org> wrote:
> > > >
> > > > On Fri, Nov 21, 2025 at 01:31:43AM -0800, Chris Li wrote:
> > > > > The current zswap requires a backing swapfile. The swap slot used
> > > > > by zswap is not able to be used by the swapfile. That waste swapfile
> > > > > space.
> > > > >
> > > > > The ghost swapfile is a swapfile that only contains the swapfile header
> > > > > for zswap. The swapfile header indicate the size of the swapfile. There
> > > > > is no swap data section in the ghost swapfile, therefore, no waste of
> > > > > swapfile space. As such, any write to a ghost swapfile will fail. To
> > > > > prevents accidental read or write of ghost swapfile, bdev of
> > > > > swap_info_struct is set to NULL. Ghost swapfile will also set the SSD
> > > > > flag because there is no rotation disk access when using zswap.
> > > >
> > > > Zswap is primarily a compressed cache for real swap on secondary
> > > > storage. It's indeed quite important that entries currently in zswap
> > > > don't occupy disk slots; but for a solution to this to be acceptable,
> > > > it has to work with the primary usecase and support disk writeback.
> > >
> > > Well, my plan is to support the writeback via swap.tiers.
> >
> > Do you have a link to that proposal?
> >
> > My understanding of swap tiers was about grouping different swapfiles
> > and assigning them to cgroups. The issue with writeback is relocating
> > the data that a swp_entry_t page table refers to - without having to
> > find and update all the possible page tables. I'm not sure how
> > swap.tiers solve this problem.
> >
> > > > This direction is a dead-end. Please take a look at Nhat's swap
> > > > virtualization patches. They decouple zswap from disk geometry, while
> > > > still supporting writeback to an actual backend file.
> > >
> > > Yes, there are many ways to decouple zswap from disk geometry, my swap
> > > table + swap.tiers design can do that as well. I have concerns about
> > > swap virtualization in the aspect of adding another layer of memory
> > > overhead addition per swap entry and CPU overhead of extra xarray
> > > lookup. I believe my approach is technically superior and cleaner.
> > > Both faster and cleaner. Basically swap.tiers + VFS like swap read
> > > write page ops. I will let Nhat clarify the performance and memory
> > > overhead side of the swap virtualization.
> >
> > I'm happy to discuss it.
> >
> > But keep in mind that the swap virtualization idea is a collaborative
> > product of quite a few people with an extensive combined upstream
> > record. Quite a bit of thought has gone into balancing static vs
> > runtime costs of that proposal. So you'll forgive me if I'm a bit
> > skeptical of the somewhat grandiose claims of one person that is new
> > to upstream development.
> >
> > As to your specific points - we use xarray lookups in the page cache
> > fast path. It's a bold claim to say this would be too much overhead
> > during swapins.
> >
> > Two, it's not clear to me how you want to make writeback efficient
> > *without* any sort of swap entry redirection. Walking all relevant
> > page tables is expensive; and you have to be able to find them first.
> >
> > If you're talking about a redirection array as opposed to a tree -
> > static sizing of the compressed space is also a no-go. Zswap
> > utilization varies *widely* between workloads and different workload
> > combinations. Further, zswap consumes the same fungible resource as
> > uncompressed memory - there is really no excuse to burden users with
> > static sizing questions about this pool.
>
> I think what Chris's idea is (and Chris correct me if I am wrong), is
> that we use ghost swapfiles (that are not backed by disk space) for
> zswap. So zswap has its own swapfiles, separate from disk swapfiles.
>
> memory.tiers establishes the ordering between swapfiles, so you put
> "ghost" -> "real" to get today's zswap writeback behavior. When you
> writeback, you keep page tables pointing at the swap entry in the ghost
> swapfile. What you do is:
> - Allocate a new swap entry in the "real" swapfile.
> - Update the swap table of the "ghost" swapfile to point at the swap
> entry in the "real" swapfile, reusing the pointer used for the
> swapcache.
>
> Then, on swapin, you read the swap table of the "ghost" swapfile, find
> the redirection, and read to the swap table of the "real" swapfile, then
> read the page from disk into the swap cache. The redirection in the
> "ghost" swapfile will keep existing, wasting that slot, until all
> references to it are dropped.
>
> I think this might work for this specific use case, with less overhead
> than the xarray. BUT there are a few scenarios that are not covered
> AFAICT:
Thanks for explaining these issues better than I could :)
>
> - You still need to statically size the ghost swapfiles and their
> overheads.
Yes.
>
> - Wasting a slot in the ghost swapfile for the redirection. This
> complicates static provisioning a bit, because you have to account for
> entries that will be in zswap as well as writtenback. Furthermore,
> IIUC swap.tiers is intended to be generic and cover other use cases
> beyond zswap like SSD -> HDD. For that, I think wasting a slot in the
> SSD when we writeback to the HDD is a much bigger problem.
Yep. We are trying to get away from static provisioning as much as we
can - this design digs us deeper in the hole. Who the hell know what's
the zswap:disk swap split is going to be? It's going to depend on
access patterns and compressibility.
>
> - We still cannot do swapoff efficiently as we need to walk the page
> tables (and some swap tables) to find and swapin all entries in a
> swapfile. Not as important as other things, but worth mentioning.
Yeah I left swapoff out of it, because it is just another use case.
But yes we can't do swapoff efficiently easily either.
And in general, it's going to be a very rigid design for more
complicated backend change (pre-fetching from one tier to another, or
compaction).
Powered by blists - more mailing lists