linux-kernel - Re: [RFC PATCH v2 00/18] Virtual Swap Space

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7DGMS5A4t6nOQmwyLy5Px96aoejBkiwFHgy9uMk-F8Y-w@mail.gmail.com>
Date: Mon, 2 Jun 2025 00:14:53 +0800
From: Kairui Song <ryncsn@...il.com>
To: YoungJun Park <youngjun.park@....com>
Cc: Nhat Pham <nphamcs@...il.com>, linux-mm@...ck.org, akpm@...ux-foundation.org, 
	hannes@...xchg.org, hughd@...gle.com, yosry.ahmed@...ux.dev, 
	mhocko@...nel.org, roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, 
	muchun.song@...ux.dev, len.brown@...el.com, chengming.zhou@...ux.dev, 
	chrisl@...nel.org, huang.ying.caritas@...il.com, ryan.roberts@....com, 
	viro@...iv.linux.org.uk, baohua@...nel.org, osalvador@...e.de, 
	lorenzo.stoakes@...cle.com, christophe.leroy@...roup.eu, pavel@...nel.org, 
	kernel-team@...a.com, linux-kernel@...r.kernel.org, cgroups@...r.kernel.org, 
	linux-pm@...r.kernel.org, peterx@...hat.com, gunho.lee@....com, 
	taejoon.song@....com, iamjoonsoo.kim@....com
Subject: Re: [RFC PATCH v2 00/18] Virtual Swap Space

On Sun, Jun 1, 2025 at 8:56 PM YoungJun Park <youngjun.park@....com> wrote:
>
> On Fri, May 30, 2025 at 09:52:42AM -0700, Nhat Pham wrote:
> > On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@....com> wrote:
> > >
> > > On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > > > Changelog:
> > > > * v2:
> > > >       * Use a single atomic type (swap_refs) for reference counting
> > > >         purpose. This brings the size of the swap descriptor from 64 KB
> > > >         down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> > > >       * Zeromap bitmap is removed in the virtual swap implementation.
> > > >         This saves one bit per phyiscal swapfile slot.
> > > >       * Rearrange the patches and the code change to make things more
> > > >         reviewable. Suggested by Johannes Weiner.
> > > >       * Update the cover letter a bit.
> > >
> > > Hi Nhat,
> > >
> > > Thank you for sharing this patch series.
> > > I’ve read through it with great interest.
> > >
> > > I’m part of a kernel team working on features related to multi-tier swapping,
> > > and this patch set appears quite relevant
> > > to our ongoing discussions and early-stage implementation.
> >
> > May I ask - what's the use case you're thinking of here? Remote swapping?
> >
>
> Yes, that's correct.
> Our usage scenario includes remote swap,
> and we're experimenting with assigning swap tiers per cgroup
> in order to improve specific scene of our target device performance.
>
> We’ve explored several approaches and PoCs around this,
> and in the process of evaluating
> whether our direction could eventually be aligned
> with the upstream kernel,
> I came across your patchset and wanted to ask whether
> similar efforts have been discussed or attempted before.
>
> > >
> > > I had a couple of questions regarding the future direction.
> > >
> > > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > > >   transferring (promotion/demotion) of pages across tiers (see [8] and
> > > >   [9]). Similar to swapoff, with the old design we would need to
> > > >   perform the expensive page table walk.
> > >
> > > Based on the discussion in [5], it seems there was some exploration
> > > around enabling per-cgroup selection of multiple tiers.
> > > Do you envision the current design evolving in a similar direction
> > > to those past discussions, or is there a different direction you're aiming for?
> >
> > IIRC, that past design focused on the interface aspect of the problem,
> > but never actually touched the mechanism to implement a multi-tier
> > swapping solution.
> >
> > The simple reason is it's impossible, or at least highly inefficient
> > to do it in the current design, i.e without virtualizing swap. Storing
>
> As you pointed out, there are certainly inefficiencies
> in supporting this use case with the current design,
> but if there is a valid use case,
> I believe there’s room for it to be supported in the current model
> —possibly in a less optimized form—
> until a virtual swap device becomes available
> and provides a more efficient solution.
> What do you think about?

Hi All,

I'd like to share some info from my side. Currently we have an
internal solution for multi tier swap, implemented based on ZRAM and
writeback: 4 compression level and multiple block layer level. The
ZRAM table serves a similar role to the swap table in the "swap table
series" or the virtual layer here.

We hacked the BIO layer to let ZRAM be Cgroup aware, so it even
supports per-cgroup priority, and per-cgroup writeback control, and it
worked perfectly fine in production.

The interface looks something like this:
/sys/fs/cgroup/cg1/zram.prio: [1-4]
/sys/fs/cgroup/cg1/zram.writeback_prio [1-4]
/sys/fs/cgroup/cg1/zram.writeback_size [0 - 4K]

It's really nothing fancy and complex, the four priority is simply the
four ZRAM compression streams that's already in upstream, and you can
simply hardcode four *bdev in "struct zram" and reuse the bits, then
chain the write bio with new underlayer bio... Getting the priority
info of a cgroup is even simpler once ZRAM is cgroup aware.

All interfaces can be adjusted dynamically at any time (e.g. by an
agent), and already swapped out pages won't be touched. The block
devices are specified in ZRAM's sys files during swapon.

It's easy to implement, but not a good idea for upstream at all:
redundant layers, and performance is bad (if not optimized):
- it breaks SYNCHRONIZE_IO, causing a huge slowdown, so we removed the
SYNCHRONIZE_IO completely which actually improved the performance in
every aspect (I've been trying to upstream this for a while);
- ZRAM's block device allocator is just not good (just a bitmap) so we
want to use the SWAP allocator directly (which I'm also trying to
upstream with the swap table series);
- And many other bits and pieces like bio batching are kind of broken,
busy loop due to the ZRAM_WB bit, etc...
- Lacking support for things like effective migration/compaction,
doable but looks horrible.

So I definitely don't like this band-aid solution, but hey, it works.
I'm looking forward to replacing it with native upstream support.
That's one of the motivations behind the swap table series, which
I think it would resolve the problems in an elegant and clean way
upstreamly. The initial tests do show it has a much lower overhead
and cleans up SWAP.

But maybe this is kind of similar to the "less optimized form" you
are talking about? As I mentioned I'm already trying to upstream
some nice parts of it, and hopefully replace it with an upstream
solution finally.

I can try upstream other parts of it if there are people really
interested, but I strongly recommend that we should focus on the
right approach instead and not waste time on that and spam the
mail list.

I have no special preference on how the final upstream interface
should look like. But currently SWAP devices already have priorities,
so maybe we should just make use of that.