[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251124172717.GA476776@cmpxchg.org>
Date: Mon, 24 Nov 2025 12:27:17 -0500
From: Johannes Weiner <hannes@...xchg.org>
To: Chris Li <chrisl@...nel.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Kairui Song <kasong@...cent.com>,
Kemeng Shi <shikemeng@...weicloud.com>,
Nhat Pham <nphamcs@...il.com>, Baoquan He <bhe@...hat.com>,
Barry Song <baohua@...nel.org>, Yosry Ahmed <yosry.ahmed@...ux.dev>,
Chengming Zhou <chengming.zhou@...ux.dev>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, pratmal@...gle.com,
sweettea@...gle.com, gthelen@...gle.com, weixugc@...gle.com
Subject: Re: [PATCH RFC] mm: ghost swapfile support for zswap
On Fri, Nov 21, 2025 at 05:52:09PM -0800, Chris Li wrote:
> On Fri, Nov 21, 2025 at 3:40 AM Johannes Weiner <hannes@...xchg.org> wrote:
> >
> > On Fri, Nov 21, 2025 at 01:31:43AM -0800, Chris Li wrote:
> > > The current zswap requires a backing swapfile. The swap slot used
> > > by zswap is not able to be used by the swapfile. That waste swapfile
> > > space.
> > >
> > > The ghost swapfile is a swapfile that only contains the swapfile header
> > > for zswap. The swapfile header indicate the size of the swapfile. There
> > > is no swap data section in the ghost swapfile, therefore, no waste of
> > > swapfile space. As such, any write to a ghost swapfile will fail. To
> > > prevents accidental read or write of ghost swapfile, bdev of
> > > swap_info_struct is set to NULL. Ghost swapfile will also set the SSD
> > > flag because there is no rotation disk access when using zswap.
> >
> > Zswap is primarily a compressed cache for real swap on secondary
> > storage. It's indeed quite important that entries currently in zswap
> > don't occupy disk slots; but for a solution to this to be acceptable,
> > it has to work with the primary usecase and support disk writeback.
>
> Well, my plan is to support the writeback via swap.tiers.
Do you have a link to that proposal?
My understanding of swap tiers was about grouping different swapfiles
and assigning them to cgroups. The issue with writeback is relocating
the data that a swp_entry_t page table refers to - without having to
find and update all the possible page tables. I'm not sure how
swap.tiers solve this problem.
> > This direction is a dead-end. Please take a look at Nhat's swap
> > virtualization patches. They decouple zswap from disk geometry, while
> > still supporting writeback to an actual backend file.
>
> Yes, there are many ways to decouple zswap from disk geometry, my swap
> table + swap.tiers design can do that as well. I have concerns about
> swap virtualization in the aspect of adding another layer of memory
> overhead addition per swap entry and CPU overhead of extra xarray
> lookup. I believe my approach is technically superior and cleaner.
> Both faster and cleaner. Basically swap.tiers + VFS like swap read
> write page ops. I will let Nhat clarify the performance and memory
> overhead side of the swap virtualization.
I'm happy to discuss it.
But keep in mind that the swap virtualization idea is a collaborative
product of quite a few people with an extensive combined upstream
record. Quite a bit of thought has gone into balancing static vs
runtime costs of that proposal. So you'll forgive me if I'm a bit
skeptical of the somewhat grandiose claims of one person that is new
to upstream development.
As to your specific points - we use xarray lookups in the page cache
fast path. It's a bold claim to say this would be too much overhead
during swapins.
Two, it's not clear to me how you want to make writeback efficient
*without* any sort of swap entry redirection. Walking all relevant
page tables is expensive; and you have to be able to find them first.
If you're talking about a redirection array as opposed to a tree -
static sizing of the compressed space is also a no-go. Zswap
utilization varies *widely* between workloads and different workload
combinations. Further, zswap consumes the same fungible resource as
uncompressed memory - there is really no excuse to burden users with
static sizing questions about this pool.
Powered by blists - more mailing lists