[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aDxN6oz86TD5H4IL@yjaykim-PowerEdge-T330>
Date: Sun, 1 Jun 2025 21:56:10 +0900
From: YoungJun Park <youngjun.park@....com>
To: Nhat Pham <nphamcs@...il.com>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org, hannes@...xchg.org,
hughd@...gle.com, yosry.ahmed@...ux.dev, mhocko@...nel.org,
roman.gushchin@...ux.dev, shakeel.butt@...ux.dev,
muchun.song@...ux.dev, len.brown@...el.com,
chengming.zhou@...ux.dev, kasong@...cent.com, chrisl@...nel.org,
huang.ying.caritas@...il.com, ryan.roberts@....com,
viro@...iv.linux.org.uk, baohua@...nel.org, osalvador@...e.de,
lorenzo.stoakes@...cle.com, christophe.leroy@...roup.eu,
pavel@...nel.org, kernel-team@...a.com,
linux-kernel@...r.kernel.org, cgroups@...r.kernel.org,
linux-pm@...r.kernel.org, peterx@...hat.com, gunho.lee@....com,
taejoon.song@....com, iamjoonsoo.kim@....com
Subject: Re: [RFC PATCH v2 00/18] Virtual Swap Space
On Fri, May 30, 2025 at 09:52:42AM -0700, Nhat Pham wrote:
> On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@....com> wrote:
> >
> > On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > > Changelog:
> > > * v2:
> > > * Use a single atomic type (swap_refs) for reference counting
> > > purpose. This brings the size of the swap descriptor from 64 KB
> > > down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> > > * Zeromap bitmap is removed in the virtual swap implementation.
> > > This saves one bit per phyiscal swapfile slot.
> > > * Rearrange the patches and the code change to make things more
> > > reviewable. Suggested by Johannes Weiner.
> > > * Update the cover letter a bit.
> >
> > Hi Nhat,
> >
> > Thank you for sharing this patch series.
> > I’ve read through it with great interest.
> >
> > I’m part of a kernel team working on features related to multi-tier swapping,
> > and this patch set appears quite relevant
> > to our ongoing discussions and early-stage implementation.
>
> May I ask - what's the use case you're thinking of here? Remote swapping?
>
Yes, that's correct.
Our usage scenario includes remote swap,
and we're experimenting with assigning swap tiers per cgroup
in order to improve specific scene of our target device performance.
We’ve explored several approaches and PoCs around this,
and in the process of evaluating
whether our direction could eventually be aligned
with the upstream kernel,
I came across your patchset and wanted to ask whether
similar efforts have been discussed or attempted before.
> >
> > I had a couple of questions regarding the future direction.
> >
> > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > > transferring (promotion/demotion) of pages across tiers (see [8] and
> > > [9]). Similar to swapoff, with the old design we would need to
> > > perform the expensive page table walk.
> >
> > Based on the discussion in [5], it seems there was some exploration
> > around enabling per-cgroup selection of multiple tiers.
> > Do you envision the current design evolving in a similar direction
> > to those past discussions, or is there a different direction you're aiming for?
>
> IIRC, that past design focused on the interface aspect of the problem,
> but never actually touched the mechanism to implement a multi-tier
> swapping solution.
>
> The simple reason is it's impossible, or at least highly inefficient
> to do it in the current design, i.e without virtualizing swap. Storing
As you pointed out, there are certainly inefficiencies
in supporting this use case with the current design,
but if there is a valid use case,
I believe there’s room for it to be supported in the current model
—possibly in a less optimized form—
until a virtual swap device becomes available
and provides a more efficient solution.
What do you think about?
> the physical swap location in PTEs means that changing the swap
> backend requires a full page table walk to update all the PTEs that
> refer to the old physical swap location. So you have to pick your
> poison - either:
> 1. Pick your backend at swap out time, and never change it. You might
> not have sufficient information to decide at that time. It prevents
> you from adapting to the change in workload dynamics and working set -
> the access frequency of pages might change, so their physical location
> should change accordingly.
>
> 2. Reserve the space in every tier, and associate them with the same
> handle. This is kinda what zswap is doing. It is space efficient, and
> create a lot of operational issues in production.
>
> 3. Bite the bullet and perform the page table walk. This is what
> swapoff is doing, basically. Raise your hands if you're excited about
> a full page table walk every time you want to evict a page from zswap
> to disk swap. Booo.
>
> This new design will give us an efficient way to perform tier transfer
> - you need to figure out how to obtain the right to perform the
> transfer (for now, through the swap cache - but you can perhaps
> envision some sort of locks), and then you can simply make the change
> at the virtual layer.
>
One idea that comes to mind is whether the backend swap tier for
a page could be lazily adjusted at runtime—either reactively
or via an explicit interface—before the tier changes.
Alternatively, if it's preferable to leave pages untouched
when the tier configuration changes at runtime,
perhaps we could consider making this behavior configurable as well.
> >
> > > This idea is very similar to Kairui's work to optimize the (physical)
> > > swap allocator. He is currently also working on a swap redesign (see
> > > [11]) - perhaps we can combine the two efforts to take advantage of
> > > the swap allocator's efficiency for virtual swap.
> >
> > I noticed that your patch appears to be aligned with the work from Kairui.
> > It seems like the overall architecture may be headed toward introducing
> > a virtual swap device layer.
> > I'm curious if there’s already been any concrete discussion
> > around this abstraction, especially regarding how it might be layered over
> > multiple physical swap devices?
> >
> > From a naive perspective, I imagine that while today’s swap devices
> > are in a 1:1 mapping with physical devices,
> > this virtual layer could introduce a 1:N relationship —
> > one virtual swap device mapped to multiple physical ones.
> > Would this virtual device behave as a new swappable block device
> > exposed via `swapon`, or is the plan to abstract it differently?
>
> That was one of the ideas I was thinking of. Problem is this is a very
> special "device", and I'm not entirely sure opting in through swapon
> like that won't cause issues. Imagine the following scenario:
>
> 1. We swap on a normal swapfile.
>
> 2. Users swap things with the swapfile.
>
> 2. Sysadmin then swapon a virtual swap device.
>
> It will be quite nightmarish to manage things - we need to be extra
> vigilant in handling a physical swap slot for e.g, since it can back a
> PTE or a virtual swap slot. Also, swapoff becomes less efficient
> again. And the physical swap allocator, even with the swap table
> change, doesn't quite work out of the box for virtual swap yet (see
> [1]).
>
> I think it's better to just keep it separate, for now, and adopt
> elements from Kairui's work to make virtual swap allocation more
> efficient. Not a hill I will die on though,
>
> [1]: https://lore.kernel.org/linux-mm/CAKEwX=MmD___ukRrx=hLo7d_m1J_uG_Ke+us7RQgFUV2OSg38w@mail.gmail.com/
>
I also appreciate your thoughts on keeping the virtual
and physical swap paths separate for now.
Thanks for sharing your perspective
—it was helpful to understand the design direction.
Best regards,
YoungJun Park
Powered by blists - more mailing lists