linux-kernel - Re: [RFC PATCH v2 00/18] Virtual Swap Space

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aD3Apu+sz9Na+65B@yjaykim-PowerEdge-T330>
Date: Tue, 3 Jun 2025 00:17:58 +0900
From: YoungJun Park <youngjun.park@....com>
To: Kairui Song <ryncsn@...il.com>
Cc: Nhat Pham <nphamcs@...il.com>, linux-mm@...ck.org,
	akpm@...ux-foundation.org, hannes@...xchg.org, hughd@...gle.com,
	yosry.ahmed@...ux.dev, mhocko@...nel.org, roman.gushchin@...ux.dev,
	shakeel.butt@...ux.dev, muchun.song@...ux.dev, len.brown@...el.com,
	chengming.zhou@...ux.dev, chrisl@...nel.org,
	huang.ying.caritas@...il.com, ryan.roberts@....com,
	viro@...iv.linux.org.uk, baohua@...nel.org, osalvador@...e.de,
	lorenzo.stoakes@...cle.com, christophe.leroy@...roup.eu,
	pavel@...nel.org, kernel-team@...a.com,
	linux-kernel@...r.kernel.org, cgroups@...r.kernel.org,
	linux-pm@...r.kernel.org, peterx@...hat.com, gunho.lee@....com,
	taejoon.song@....com, iamjoonsoo.kim@....com
Subject: Re: [RFC PATCH v2 00/18] Virtual Swap Space

On Mon, Jun 02, 2025 at 12:14:53AM +0800, Kairui Song wrote:
> On Sun, Jun 1, 2025 at 8:56 PM YoungJun Park <youngjun.park@....com> wrote:
> >
> > On Fri, May 30, 2025 at 09:52:42AM -0700, Nhat Pham wrote:
> > > On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@....com> wrote:
> > > >
> > > > On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > > > > Changelog:
> > > > > * v2:
> > > > >       * Use a single atomic type (swap_refs) for reference counting
> > > > >         purpose. This brings the size of the swap descriptor from 64 KB
> > > > >         down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> > > > >       * Zeromap bitmap is removed in the virtual swap implementation.
> > > > >         This saves one bit per phyiscal swapfile slot.
> > > > >       * Rearrange the patches and the code change to make things more
> > > > >         reviewable. Suggested by Johannes Weiner.
> > > > >       * Update the cover letter a bit.
> > > >
> > > > Hi Nhat,
> > > >
> > > > Thank you for sharing this patch series.
> > > > I’ve read through it with great interest.
> > > >
> > > > I’m part of a kernel team working on features related to multi-tier swapping,
> > > > and this patch set appears quite relevant
> > > > to our ongoing discussions and early-stage implementation.
> > >
> > > May I ask - what's the use case you're thinking of here? Remote swapping?
> > >
> >
> > Yes, that's correct.
> > Our usage scenario includes remote swap,
> > and we're experimenting with assigning swap tiers per cgroup
> > in order to improve specific scene of our target device performance.
> >
> > We’ve explored several approaches and PoCs around this,
> > and in the process of evaluating
> > whether our direction could eventually be aligned
> > with the upstream kernel,
> > I came across your patchset and wanted to ask whether
> > similar efforts have been discussed or attempted before.
> >
> > > >
> > > > I had a couple of questions regarding the future direction.
> > > >
> > > > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > > > >   transferring (promotion/demotion) of pages across tiers (see [8] and
> > > > >   [9]). Similar to swapoff, with the old design we would need to
> > > > >   perform the expensive page table walk.
> > > >
> > > > Based on the discussion in [5], it seems there was some exploration
> > > > around enabling per-cgroup selection of multiple tiers.
> > > > Do you envision the current design evolving in a similar direction
> > > > to those past discussions, or is there a different direction you're aiming for?
> > >
> > > IIRC, that past design focused on the interface aspect of the problem,
> > > but never actually touched the mechanism to implement a multi-tier
> > > swapping solution.
> > >
> > > The simple reason is it's impossible, or at least highly inefficient
> > > to do it in the current design, i.e without virtualizing swap. Storing
> >
> > As you pointed out, there are certainly inefficiencies
> > in supporting this use case with the current design,
> > but if there is a valid use case,
> > I believe there’s room for it to be supported in the current model
> > —possibly in a less optimized form—
> > until a virtual swap device becomes available
> > and provides a more efficient solution.
> > What do you think about?
> 
> Hi All,
> 
> I'd like to share some info from my side. Currently we have an
> internal solution for multi tier swap, implemented based on ZRAM and
> writeback: 4 compression level and multiple block layer level. The
> ZRAM table serves a similar role to the swap table in the "swap table
> series" or the virtual layer here.
> 
> We hacked the BIO layer to let ZRAM be Cgroup aware, so it even
> supports per-cgroup priority, and per-cgroup writeback control, and it
> worked perfectly fine in production.
> 
> The interface looks something like this:
> /sys/fs/cgroup/cg1/zram.prio: [1-4]
> /sys/fs/cgroup/cg1/zram.writeback_prio [1-4]
> /sys/fs/cgroup/cg1/zram.writeback_size [0 - 4K]
> 
> It's really nothing fancy and complex, the four priority is simply the
> four ZRAM compression streams that's already in upstream, and you can
> simply hardcode four *bdev in "struct zram" and reuse the bits, then
> chain the write bio with new underlayer bio... Getting the priority
> info of a cgroup is even simpler once ZRAM is cgroup aware.
> 
> All interfaces can be adjusted dynamically at any time (e.g. by an
> agent), and already swapped out pages won't be touched. The block
> devices are specified in ZRAM's sys files during swapon.
> 
> It's easy to implement, but not a good idea for upstream at all:
> redundant layers, and performance is bad (if not optimized):
> - it breaks SYNCHRONIZE_IO, causing a huge slowdown, so we removed the
> SYNCHRONIZE_IO completely which actually improved the performance in
> every aspect (I've been trying to upstream this for a while);
> - ZRAM's block device allocator is just not good (just a bitmap) so we
> want to use the SWAP allocator directly (which I'm also trying to
> upstream with the swap table series);
> - And many other bits and pieces like bio batching are kind of broken,
> busy loop due to the ZRAM_WB bit, etc...
> - Lacking support for things like effective migration/compaction,
> doable but looks horrible.
> 

That's interesting — we've explored a similar idea as well, 
although not by attaching it to ZRAM.
Instead, our concept involved creating a separate block device 
capable of performing the tiering functionality, and using it as follows:

1. Prepare a block device that can manage multiple backend block devices.
2. Perform swapon on this block device.
3. Within the block device, use cgroup awareness 
to carry out tiered swap operations across the prepared backend devices.

However, we ended up postponing this approach as a second-tier option, mainly 
due to the following concerns:

1. The idea of allocating physical slots but managing them internally 
as logical slots felt inefficient.
2. Embedding cgroup awareness within a block device 
seemed like a layer violation.

> So I definitely don't like this band-aid solution, but hey, it works.
> I'm looking forward to replacing it with native upstream support.
> That's one of the motivations behind the swap table series, which
> I think it would resolve the problems in an elegant and clean way
> upstreamly. The initial tests do show it has a much lower overhead
> and cleans up SWAP.
> But maybe this is kind of similar to the "less optimized form" you
> are talking about? As I mentioned I'm already trying to upstream
> some nice parts of it, and hopefully replace it with an upstream
> solution finally.
> 
> I can try upstream other parts of it if there are people really
> interested, but I strongly recommend that we should focus on the
> right approach instead and not waste time on that and spam the
> mail list.

I am in agreement with the points you’ve made.
 
> I have no special preference on how the final upstream interface
> should look like. But currently SWAP devices already have priorities,
> so maybe we should just make use of that.

I have been exploring an interface design 
that leverages the existing swap priority mechanism,
and I believe it would be valuable 
to share this for further discussion and feedback.
As mentioned in my earlier response to Nhat,
I intend to submit this as an RFC to solicit broader input from the community. 

Best regards,
YoungJun Park