[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACePvbVuhvw2p6vaOd7YOsXOeS4-K2TPW=P2jhjtrNEvRZd64g@mail.gmail.com>
Date: Tue, 25 Nov 2025 22:26:32 +0400
From: Chris Li <chrisl@...nel.org>
To: Nhat Pham <nphamcs@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Kairui Song <kasong@...cent.com>,
Kemeng Shi <shikemeng@...weicloud.com>, Baoquan He <bhe@...hat.com>,
Barry Song <baohua@...nel.org>, Johannes Weiner <hannes@...xchg.org>,
Yosry Ahmed <yosry.ahmed@...ux.dev>, Chengming Zhou <chengming.zhou@...ux.dev>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, pratmal@...gle.com, sweettea@...gle.com,
gthelen@...gle.com, weixugc@...gle.com
Subject: Re: [PATCH RFC] mm: ghost swapfile support for zswap
On Mon, Nov 24, 2025 at 5:47 PM Nhat Pham <nphamcs@...il.com> wrote:
>
> On Fri, Nov 21, 2025 at 5:52 PM Chris Li <chrisl@...nel.org> wrote:
> >
> > On Fri, Nov 21, 2025 at 2:19 AM Nhat Pham <nphamcs@...il.com> wrote:
> > >
> > > On Fri, Nov 21, 2025 at 9:32 AM Chris Li <chrisl@...nel.org> wrote:
> > > >
> > > > The current zswap requires a backing swapfile. The swap slot used
> > > > by zswap is not able to be used by the swapfile. That waste swapfile
> > > > space.
> > > >
> > > > The ghost swapfile is a swapfile that only contains the swapfile header
> > > > for zswap. The swapfile header indicate the size of the swapfile. There
> > > > is no swap data section in the ghost swapfile, therefore, no waste of
> > > > swapfile space. As such, any write to a ghost swapfile will fail. To
> > > > prevents accidental read or write of ghost swapfile, bdev of
> > > > swap_info_struct is set to NULL. Ghost swapfile will also set the SSD
> > > > flag because there is no rotation disk access when using zswap.
> > >
> > > Would this also affect the swap slot allocation algorithm?
> > >
> > > >
> > > > The zswap write back has been disabled if all swapfiles in the system
> > > > are ghost swap files.
> > >
> > > I don't like this design:
> > >
> > > 1. Statically sizing the compression tier will be an operational
> > > nightmare, for users that have to support a variety (and increasingly
> > > bigger sized) types of hosts. It's one of the primary motivations of
> > > the virtual swap line of work. We need to move towards a more dynamic
> > > architecture for zswap, not the other way around, in order to reduce
> > > both (human's) operational overhead, AND actual space overhead (i.e
> > > only allocate (z)swap metadata on-demand).
> >
> > Let's do it one step at a time.
>
> I'm happy with landing these patches one step at a time. But from my
> POV (and admittedly limited imagination), it's a bit of a deadend.
>
> The only architecture, IMO, that satisfies:
>
> 1. Dynamic overhead of (z)swap metadata.
>
> 2. Decouple swap backends, i.e no pre-reservation of lower tiers space
> (what zswap is doing right now).
>
> 3. Backend transfer without page table walks.
>
> is swap virtualization.
>
> If you want to present an alternative vision, you don't have to
> implement it right away, but you have to at least explain to me how to
> achieve all these 3.
>From 1,2,3 to SV as the only solution is a big jump. How many
possibilities have you explored to conclude that no other solution can
satisfy your 123?
I just replied to Rik's email about the high level sketch design. My
design should satisfy it and can serve as one counter example of
alternative design.
>
> >
> > > 2. This digs us in the hole of supporting a special infrastructure for
> > > non-writeback cases. Now every future change to zswap's architecture
> > > has to take this into account. It's not easy to turn this design into
> > > something that can support writeback - you're stuck with either having
> > > to do an expensive page table walk to update the PTEs, or shoving the
> > > virtual swap layer inside zswap. Ugly.
> >
> > What are you talking about? This patch does not have any page table
> > work. You are opposing something in your imagination. Please show me
> > the code in which I do expensive PTE walks.
>
> Please read my response again. I did not say you did any PTE walk in this patch.
>
> What I meant was, if you want to make this the general architecture
> for zswap and not some niche infrastructure for specialized use case,
> you need to be able to support backend transfer, i.e zswap writeback
> (zswap -> disk swap, and perhaps in the future the other direction).
> This will be very expensive with this design.
I can't say I agree with you. It seems you have made a lot of
assumptions in your reasoning.
> > > 3. And what does this even buy us? Just create a fake in-memory-only
> > > swapfile (heck, you can use zram), disable writeback (which you can do
> > > both at a cgroup and host-level), and call it a day.
> >
> > Well this provides users a choice, if they don't care about write
> > backs. They can do zswap with ghost swapfile now without actually
> > wasting disk space.
> >
> > It also does not stop zswap using write back with normal SSD. If you
> > want to write back, you can still use a non ghost swapfile as normal.
> >
> > It is a simple enough patch to provide value right now. It also fits
> > into the swap.tiers long term roadmap to have a seperate tier for
> > memory based swapfiles. I believe that is a cleaner picture than the
> > current zswap as cache but also gets its hands so deep into the swap
> > stack and slows down other swap tiers.
> >
> > > Nacked-by: Nhat Pham <nphamcs@...il.com>
> >
> > I heard you, if you don't don't want zswap to have anything to do
> > with memory based swap tier in the swap.tiers design. I respect your
> > choice.
>
> Where does this even come from?
>
> I can't speak for Johannes or Yosry, but personally I'm ambivalent
> with respect to swap.tiers. My only objection in the past was there
> was not any use case at a time, but there seems to be one now. I won't
> stand in the way of swap.tiers landing, or zswap's integration into
> it.
>
> From my POV, swap.tiers solve a problem completely orthogonal to what
> I'm trying to solve, namely, the three points listed above. It's about
> definition of swap hierarchy, either at initial placement time, or
> during offloading from one backend to another, where as I'm trying to
> figure out the mechanistic side of it (how to transfer a page from one
> backend to another without page table walking). These two are
> independent, if not synergistic.
I think our goal overlaps, just a different approach with different
performance charistic.
I have asked in this thread a few times, how big is the per swap slot
memory overhead VS introduced?
That is something that I care about a lot.
Chris
Powered by blists - more mailing lists