[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=NFrWyFkyd5XhXEb_qYtWBk4yPUMPPJN0qHPAvzPUq_Dg@mail.gmail.com>
Date: Sun, 1 Jun 2025 14:08:22 -0700
From: Nhat Pham <nphamcs@...il.com>
To: YoungJun Park <youngjun.park@....com>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org, hannes@...xchg.org,
hughd@...gle.com, yosry.ahmed@...ux.dev, mhocko@...nel.org,
roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev,
len.brown@...el.com, chengming.zhou@...ux.dev, kasong@...cent.com,
chrisl@...nel.org, huang.ying.caritas@...il.com, ryan.roberts@....com,
viro@...iv.linux.org.uk, baohua@...nel.org, osalvador@...e.de,
lorenzo.stoakes@...cle.com, christophe.leroy@...roup.eu, pavel@...nel.org,
kernel-team@...a.com, linux-kernel@...r.kernel.org, cgroups@...r.kernel.org,
linux-pm@...r.kernel.org, peterx@...hat.com, gunho.lee@....com,
taejoon.song@....com, iamjoonsoo.kim@....com
Subject: Re: [RFC PATCH v2 00/18] Virtual Swap Space
On Sun, Jun 1, 2025 at 5:56 AM YoungJun Park <youngjun.park@....com> wrote:
>
> On Fri, May 30, 2025 at 09:52:42AM -0700, Nhat Pham wrote:
> > On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@....com> wrote:
> > >
> > > On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > > > Changelog:
> > > > * v2:
> > > > * Use a single atomic type (swap_refs) for reference counting
> > > > purpose. This brings the size of the swap descriptor from 64 KB
> > > > down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> > > > * Zeromap bitmap is removed in the virtual swap implementation.
> > > > This saves one bit per phyiscal swapfile slot.
> > > > * Rearrange the patches and the code change to make things more
> > > > reviewable. Suggested by Johannes Weiner.
> > > > * Update the cover letter a bit.
> > >
> > > Hi Nhat,
> > >
> > > Thank you for sharing this patch series.
> > > I’ve read through it with great interest.
> > >
> > > I’m part of a kernel team working on features related to multi-tier swapping,
> > > and this patch set appears quite relevant
> > > to our ongoing discussions and early-stage implementation.
> >
> > May I ask - what's the use case you're thinking of here? Remote swapping?
> >
>
> Yes, that's correct.
> Our usage scenario includes remote swap,
> and we're experimenting with assigning swap tiers per cgroup
> in order to improve specific scene of our target device performance.
Hmm, that can be a start. Right now, we have only 2 swap tiers
essentially, so memory.(z)swap.max and memory.zswap.writeback is
usually sufficient to describe the tiering interface. But if you have
an alternative use case in mind feel free to send a RFC to explore
this!
>
> We’ve explored several approaches and PoCs around this,
> and in the process of evaluating
> whether our direction could eventually be aligned
> with the upstream kernel,
> I came across your patchset and wanted to ask whether
> similar efforts have been discussed or attempted before.
I think it is occasionally touched upon in discussion, but AFAICS
there has not been really an actual upstream patch to add such an
interface.
>
> > >
> > > I had a couple of questions regarding the future direction.
> > >
> > > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > > > transferring (promotion/demotion) of pages across tiers (see [8] and
> > > > [9]). Similar to swapoff, with the old design we would need to
> > > > perform the expensive page table walk.
> > >
> > > Based on the discussion in [5], it seems there was some exploration
> > > around enabling per-cgroup selection of multiple tiers.
> > > Do you envision the current design evolving in a similar direction
> > > to those past discussions, or is there a different direction you're aiming for?
> >
> > IIRC, that past design focused on the interface aspect of the problem,
> > but never actually touched the mechanism to implement a multi-tier
> > swapping solution.
> >
> > The simple reason is it's impossible, or at least highly inefficient
> > to do it in the current design, i.e without virtualizing swap. Storing
>
> As you pointed out, there are certainly inefficiencies
> in supporting this use case with the current design,
> but if there is a valid use case,
> I believe there’s room for it to be supported in the current model
> —possibly in a less optimized form—
> until a virtual swap device becomes available
> and provides a more efficient solution.
> What do you think about?
Which less optimized form are you thinking of?
>
> > the physical swap location in PTEs means that changing the swap
> > backend requires a full page table walk to update all the PTEs that
> > refer to the old physical swap location. So you have to pick your
> > poison - either:
> > 1. Pick your backend at swap out time, and never change it. You might
> > not have sufficient information to decide at that time. It prevents
> > you from adapting to the change in workload dynamics and working set -
> > the access frequency of pages might change, so their physical location
> > should change accordingly.
> >
> > 2. Reserve the space in every tier, and associate them with the same
> > handle. This is kinda what zswap is doing. It is space efficient, and
> > create a lot of operational issues in production.
> >
> > 3. Bite the bullet and perform the page table walk. This is what
> > swapoff is doing, basically. Raise your hands if you're excited about
> > a full page table walk every time you want to evict a page from zswap
> > to disk swap. Booo.
> >
> > This new design will give us an efficient way to perform tier transfer
> > - you need to figure out how to obtain the right to perform the
> > transfer (for now, through the swap cache - but you can perhaps
> > envision some sort of locks), and then you can simply make the change
> > at the virtual layer.
> >
>
> One idea that comes to mind is whether the backend swap tier for
> a page could be lazily adjusted at runtime—either reactively
> or via an explicit interface—before the tier changes.
> Alternatively, if it's preferable to leave pages untouched
> when the tier configuration changes at runtime,
> perhaps we could consider making this behavior configurable as well.
>
I don't quite understand - could you expand on this?
Powered by blists - more mailing lists