linux-kernel - Re: [RFC PATCH v2 00/18] Virtual Swap Space

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aD29WJLms0CJVh8A@yjaykim-PowerEdge-T330>
Date: Tue, 3 Jun 2025 00:03:52 +0900
From: YoungJun Park <youngjun.park@....com>
To: Nhat Pham <nphamcs@...il.com>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org, hannes@...xchg.org,
	hughd@...gle.com, yosry.ahmed@...ux.dev, mhocko@...nel.org,
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev,
	muchun.song@...ux.dev, len.brown@...el.com,
	chengming.zhou@...ux.dev, kasong@...cent.com, chrisl@...nel.org,
	huang.ying.caritas@...il.com, ryan.roberts@....com,
	viro@...iv.linux.org.uk, baohua@...nel.org, osalvador@...e.de,
	lorenzo.stoakes@...cle.com, christophe.leroy@...roup.eu,
	pavel@...nel.org, kernel-team@...a.com,
	linux-kernel@...r.kernel.org, cgroups@...r.kernel.org,
	linux-pm@...r.kernel.org, peterx@...hat.com, gunho.lee@....com,
	taejoon.song@....com, iamjoonsoo.kim@....com
Subject: Re: [RFC PATCH v2 00/18] Virtual Swap Space

On Sun, Jun 01, 2025 at 02:08:22PM -0700, Nhat Pham wrote:
> On Sun, Jun 1, 2025 at 5:56 AM YoungJun Park <youngjun.park@....com> wrote:
> >
> > On Fri, May 30, 2025 at 09:52:42AM -0700, Nhat Pham wrote:
> > > On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@....com> wrote:
> > > >
> > > > On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > > > > Changelog:
> > > > > * v2:
> > > > >       * Use a single atomic type (swap_refs) for reference counting
> > > > >         purpose. This brings the size of the swap descriptor from 64 KB
> > > > >         down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> > > > >       * Zeromap bitmap is removed in the virtual swap implementation.
> > > > >         This saves one bit per phyiscal swapfile slot.
> > > > >       * Rearrange the patches and the code change to make things more
> > > > >         reviewable. Suggested by Johannes Weiner.
> > > > >       * Update the cover letter a bit.
> > > >
> > > > Hi Nhat,
> > > >
> > > > Thank you for sharing this patch series.
> > > > I’ve read through it with great interest.
> > > >
> > > > I’m part of a kernel team working on features related to multi-tier swapping,
> > > > and this patch set appears quite relevant
> > > > to our ongoing discussions and early-stage implementation.
> > >
> > > May I ask - what's the use case you're thinking of here? Remote swapping?
> > >
> >
> > Yes, that's correct.
> > Our usage scenario includes remote swap,
> > and we're experimenting with assigning swap tiers per cgroup
> > in order to improve specific scene of our target device performance.
> 
> Hmm, that can be a start. Right now, we have only 2 swap tiers
> essentially, so memory.(z)swap.max and memory.zswap.writeback is
> usually sufficient to describe the tiering interface. But if you have
> an alternative use case in mind feel free to send a RFC to explore
> this!
>

Yes, sounds good.
I've organized the details of our swap tiering approach 
including the specific use case we are trying to solve.
This approach is based on leveraging 
the existing priority mechanism in the swap subsystem.
I’ll be sharing it as an RFC shortly.
 
> >
> > We’ve explored several approaches and PoCs around this,
> > and in the process of evaluating
> > whether our direction could eventually be aligned
> > with the upstream kernel,
> > I came across your patchset and wanted to ask whether
> > similar efforts have been discussed or attempted before.
> 
> I think it is occasionally touched upon in discussion, but AFAICS
> there has not been really an actual upstream patch to add such an
> interface.
> 
> >
> > > >
> > > > I had a couple of questions regarding the future direction.
> > > >
> > > > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > > > >   transferring (promotion/demotion) of pages across tiers (see [8] and
> > > > >   [9]). Similar to swapoff, with the old design we would need to
> > > > >   perform the expensive page table walk.
> > > >
> > > > Based on the discussion in [5], it seems there was some exploration
> > > > around enabling per-cgroup selection of multiple tiers.
> > > > Do you envision the current design evolving in a similar direction
> > > > to those past discussions, or is there a different direction you're aiming for?
> > >
> > > IIRC, that past design focused on the interface aspect of the problem,
> > > but never actually touched the mechanism to implement a multi-tier
> > > swapping solution.
> > >
> > > The simple reason is it's impossible, or at least highly inefficient
> > > to do it in the current design, i.e without virtualizing swap. Storing
> >
> > As you pointed out, there are certainly inefficiencies
> > in supporting this use case with the current design,
> > but if there is a valid use case,
> > I believe there’s room for it to be supported in the current model
> > —possibly in a less optimized form—
> > until a virtual swap device becomes available
> > and provides a more efficient solution.
> > What do you think about?
> 
> Which less optimized form are you thinking of?
>

Just mentioning current swap design would be less optimized regardless 
of the form of tiering applied.
Not meaninig my approach is less optimized.
That may have come across differently than I intended.
Please feel free to disregard that assumption — 
I believe it would be more appropriate 
to evaluate this based on the RFC I plan to share soon.
 
> >
> > > the physical swap location in PTEs means that changing the swap
> > > backend requires a full page table walk to update all the PTEs that
> > > refer to the old physical swap location. So you have to pick your
> > > poison - either:
> > > 1. Pick your backend at swap out time, and never change it. You might
> > > not have sufficient information to decide at that time. It prevents
> > > you from adapting to the change in workload dynamics and working set -
> > > the access frequency of pages might change, so their physical location
> > > should change accordingly.
> > >
> > > 2. Reserve the space in every tier, and associate them with the same
> > > handle. This is kinda what zswap is doing. It is space efficient, and
> > > create a lot of operational issues in production.
> > >
> > > 3. Bite the bullet and perform the page table walk. This is what
> > > swapoff is doing, basically. Raise your hands if you're excited about
> > > a full page table walk every time you want to evict a page from zswap
> > > to disk swap. Booo.
> > >
> > > This new design will give us an efficient way to perform tier transfer
> > > - you need to figure out how to obtain the right to perform the
> > > transfer (for now, through the swap cache - but you can perhaps
> > > envision some sort of locks), and then you can simply make the change
> > > at the virtual layer.
> > >
> >
> > One idea that comes to mind is whether the backend swap tier for
> > a page could be lazily adjusted at runtime—either reactively
> > or via an explicit interface—before the tier changes.
> > Alternatively, if it's preferable to leave pages untouched
> > when the tier configuration changes at runtime,
> > perhaps we could consider making this behavior configurable as well.
> >
> 
> I don't quite understand - could you expand on this?
>

Regarding your point, 
my understanding was that you were referring
to an immediate migration once a new swap tier is selected at runtime. 
I was suggesting whether a lazy migration approach
—or even skipping migration altogether—might 
be worth considering as alternatives.
I only mentioned it because, from our use case perspective, 
immediate migration is not strictly necessary.

Best regards,
YoungJun Park