linux-kernel - Re: [RFC PATCH v2 00/18] Virtual Swap Space

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMgjq7D4gcOih3235DRBEOv4EaaV3YEKc6w2Ab-wTCgb7=sA6w@mail.gmail.com>
Date: Tue, 3 Jun 2025 17:50:01 +0800
From: Kairui Song <ryncsn@...il.com>
To: Nhat Pham <nphamcs@...il.com>
Cc: YoungJun Park <youngjun.park@....com>, linux-mm@...ck.org, akpm@...ux-foundation.org, 
	hannes@...xchg.org, hughd@...gle.com, yosry.ahmed@...ux.dev, 
	mhocko@...nel.org, roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, 
	muchun.song@...ux.dev, len.brown@...el.com, chengming.zhou@...ux.dev, 
	chrisl@...nel.org, huang.ying.caritas@...il.com, ryan.roberts@....com, 
	viro@...iv.linux.org.uk, baohua@...nel.org, osalvador@...e.de, 
	lorenzo.stoakes@...cle.com, christophe.leroy@...roup.eu, pavel@...nel.org, 
	kernel-team@...a.com, linux-kernel@...r.kernel.org, cgroups@...r.kernel.org, 
	linux-pm@...r.kernel.org, peterx@...hat.com, gunho.lee@....com, 
	taejoon.song@....com, iamjoonsoo.kim@....com
Subject: Re: [RFC PATCH v2 00/18] Virtual Swap Space

On Tue, Jun 3, 2025 at 2:30 AM Nhat Pham <nphamcs@...il.com> wrote:
>
> On Sun, Jun 1, 2025 at 9:15 AM Kairui Song <ryncsn@...il.com> wrote:
> >
> >
> > Hi All,
>
> Thanks for sharing your setup, Kairui! I've always been curious about
> multi-tier compression swapping.
>
> >
> > I'd like to share some info from my side. Currently we have an
> > internal solution for multi tier swap, implemented based on ZRAM and
> > writeback: 4 compression level and multiple block layer level. The
> > ZRAM table serves a similar role to the swap table in the "swap table
> > series" or the virtual layer here.
> >
> > We hacked the BIO layer to let ZRAM be Cgroup aware, so it even
>
> Hmmm this part seems a bit hacky to me too :-?

Yeah, terribly hackish :P

One of the reasons why I'm trying to retire it.

>
> > supports per-cgroup priority, and per-cgroup writeback control, and it
> > worked perfectly fine in production.
> >
> > The interface looks something like this:
> > /sys/fs/cgroup/cg1/zram.prio: [1-4]
> > /sys/fs/cgroup/cg1/zram.writeback_prio [1-4]
> > /sys/fs/cgroup/cg1/zram.writeback_size [0 - 4K]
>
> How do you do aging with multiple tiers like this? Or do you just rely
> on time thresholds, and have userspace invokes writeback in a cron
> job-style?

ZRAM already has a time threshold, and I added another LRU for swapped
out entries, aging is supposed to be done by userspace agents, I
didn't mention it here as things are becoming more irrelevant to
upstream implementation.

> Tbh, I'm surprised that we see performance win with recompression. I
> understand that different workloads might benefit the most from
> different points in the Pareto frontier of latency-memory saving:
> latency-sensitive workloads might like a fast compression algorithm,
> whereas other workloads might prefer a compression algorithm that
> saves more memory. So a per-cgroup compressor selection can make
> sense.
>
> However, would the overhead of moving a page from one tier to the
> other not eat up all the benefit from the (usually small) extra memory
> savings?

So far we are not re-compressing things, but per-cgroup compression /
writeback level is useful indeed. Compressed memory gets written back
to the block device, that's a large gain.

> > It's really nothing fancy and complex, the four priority is simply the
> > four ZRAM compression streams that's already in upstream, and you can
> > simply hardcode four *bdev in "struct zram" and reuse the bits, then
> > chain the write bio with new underlayer bio... Getting the priority
> > info of a cgroup is even simpler once ZRAM is cgroup aware.
> >
> > All interfaces can be adjusted dynamically at any time (e.g. by an
> > agent), and already swapped out pages won't be touched. The block
> > devices are specified in ZRAM's sys files during swapon.
> >
> > It's easy to implement, but not a good idea for upstream at all:
> > redundant layers, and performance is bad (if not optimized):
> > - it breaks SYNCHRONIZE_IO, causing a huge slowdown, so we removed the
> > SYNCHRONIZE_IO completely which actually improved the performance in
> > every aspect (I've been trying to upstream this for a while);
> > - ZRAM's block device allocator is just not good (just a bitmap) so we
> > want to use the SWAP allocator directly (which I'm also trying to
> > upstream with the swap table series);
> > - And many other bits and pieces like bio batching are kind of broken,
>
> Interesting, is zram doing writeback batching?

Nope, it even has a comment saying "XXX: A single page IO would be
inefficient for write". We managed to chain bio on the initial page
writeback but still not an ideal design.

> > busy loop due to the ZRAM_WB bit, etc...
>
> Hmmm, this sounds like something swap cache can help with. It's the
> approach zswap writeback is taking - concurrent assessors can get the
> page in the swap cache, and OTOH zswap writeback back off if it
> detects swap cache contention (since the page is probably being
> swapped in, freed, or written back by another thread).
>
> But I'm not sure how zram writeback works...

Yeah, any bit lock design suffers a similar problem (like
SWAP_HAS_CACHE), I think we should just use folio lock or folio
writeback in the long term, it works extremely well as a generic
infrastructure (which I'm trying to push upstream) and we don't need
any extra locking, minimizing memory / design overhead.

> > - Lacking support for things like effective migration/compaction,
> > doable but looks horrible.
> >
> > So I definitely don't like this band-aid solution, but hey, it works.
> > I'm looking forward to replacing it with native upstream support.
> > That's one of the motivations behind the swap table series, which
> > I think it would resolve the problems in an elegant and clean way
> > upstreamly. The initial tests do show it has a much lower overhead
> > and cleans up SWAP.
> >
> > But maybe this is kind of similar to the "less optimized form" you
> > are talking about? As I mentioned I'm already trying to upstream
> > some nice parts of it, and hopefully replace it with an upstream
> > solution finally.
> >
> > I can try upstream other parts of it if there are people really
> > interested, but I strongly recommend that we should focus on the
> > right approach instead and not waste time on that and spam the
> > mail list.
>
> I suppose a lot of this is specific to zram, but bits and pieces of it
> sound upstreamable to me :)
>
> We can wait for YoungJun's patches/RFC for further discussion, but perhaps:
>
> 1. A new cgroup interface to select swap backends for a cgroup.
>
> 2. Writeback/fallback order either designated by the above interface,
> or by the priority of the swap backends.

Fully agree, the final interface and features definitely need more
discussion and collab in upstream...