linux-kernel - Re: [RFC PATCH v2 00/18] Virtual Swap Space

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=P4Q6jNQAi+H3sMQ73z-F-rG5jz8jj1NeGgUi=Pem_ZTQ@mail.gmail.com>
Date: Mon, 2 Jun 2025 11:29:53 -0700
From: Nhat Pham <nphamcs@...il.com>
To: Kairui Song <ryncsn@...il.com>
Cc: YoungJun Park <youngjun.park@....com>, linux-mm@...ck.org, akpm@...ux-foundation.org, 
	hannes@...xchg.org, hughd@...gle.com, yosry.ahmed@...ux.dev, 
	mhocko@...nel.org, roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, 
	muchun.song@...ux.dev, len.brown@...el.com, chengming.zhou@...ux.dev, 
	chrisl@...nel.org, huang.ying.caritas@...il.com, ryan.roberts@....com, 
	viro@...iv.linux.org.uk, baohua@...nel.org, osalvador@...e.de, 
	lorenzo.stoakes@...cle.com, christophe.leroy@...roup.eu, pavel@...nel.org, 
	kernel-team@...a.com, linux-kernel@...r.kernel.org, cgroups@...r.kernel.org, 
	linux-pm@...r.kernel.org, peterx@...hat.com, gunho.lee@....com, 
	taejoon.song@....com, iamjoonsoo.kim@....com
Subject: Re: [RFC PATCH v2 00/18] Virtual Swap Space

On Sun, Jun 1, 2025 at 9:15 AM Kairui Song <ryncsn@...il.com> wrote:
>
>
> Hi All,

Thanks for sharing your setup, Kairui! I've always been curious about
multi-tier compression swapping.

>
> I'd like to share some info from my side. Currently we have an
> internal solution for multi tier swap, implemented based on ZRAM and
> writeback: 4 compression level and multiple block layer level. The
> ZRAM table serves a similar role to the swap table in the "swap table
> series" or the virtual layer here.
>
> We hacked the BIO layer to let ZRAM be Cgroup aware, so it even

Hmmm this part seems a bit hacky to me too :-?

> supports per-cgroup priority, and per-cgroup writeback control, and it
> worked perfectly fine in production.
>
> The interface looks something like this:
> /sys/fs/cgroup/cg1/zram.prio: [1-4]
> /sys/fs/cgroup/cg1/zram.writeback_prio [1-4]
> /sys/fs/cgroup/cg1/zram.writeback_size [0 - 4K]

How do you do aging with multiple tiers like this? Or do you just rely
on time thresholds, and have userspace invokes writeback in a cron
job-style?

Tbh, I'm surprised that we see performance win with recompression. I
understand that different workloads might benefit the most from
different points in the Pareto frontier of latency-memory saving:
latency-sensitive workloads might like a fast compression algorithm,
whereas other workloads might prefer a compression algorithm that
saves more memory. So a per-cgroup compressor selection can make
sense.

However, would the overhead of moving a page from one tier to the
other not eat up all the benefit from the (usually small) extra memory
savings?

>
> It's really nothing fancy and complex, the four priority is simply the
> four ZRAM compression streams that's already in upstream, and you can
> simply hardcode four *bdev in "struct zram" and reuse the bits, then
> chain the write bio with new underlayer bio... Getting the priority
> info of a cgroup is even simpler once ZRAM is cgroup aware.
>
> All interfaces can be adjusted dynamically at any time (e.g. by an
> agent), and already swapped out pages won't be touched. The block
> devices are specified in ZRAM's sys files during swapon.
>
> It's easy to implement, but not a good idea for upstream at all:
> redundant layers, and performance is bad (if not optimized):
> - it breaks SYNCHRONIZE_IO, causing a huge slowdown, so we removed the
> SYNCHRONIZE_IO completely which actually improved the performance in
> every aspect (I've been trying to upstream this for a while);
> - ZRAM's block device allocator is just not good (just a bitmap) so we
> want to use the SWAP allocator directly (which I'm also trying to
> upstream with the swap table series);
> - And many other bits and pieces like bio batching are kind of broken,

Interesting, is zram doing writeback batching?

> busy loop due to the ZRAM_WB bit, etc...

Hmmm, this sounds like something swap cache can help with. It's the
approach zswap writeback is taking - concurrent assessors can get the
page in the swap cache, and OTOH zswap writeback back off if it
detects swap cache contention (since the page is probably being
swapped in, freed, or written back by another thread).

But I'm not sure how zram writeback works...

> - Lacking support for things like effective migration/compaction,
> doable but looks horrible.
>
> So I definitely don't like this band-aid solution, but hey, it works.
> I'm looking forward to replacing it with native upstream support.
> That's one of the motivations behind the swap table series, which
> I think it would resolve the problems in an elegant and clean way
> upstreamly. The initial tests do show it has a much lower overhead
> and cleans up SWAP.
>
> But maybe this is kind of similar to the "less optimized form" you
> are talking about? As I mentioned I'm already trying to upstream
> some nice parts of it, and hopefully replace it with an upstream
> solution finally.
>
> I can try upstream other parts of it if there are people really
> interested, but I strongly recommend that we should focus on the
> right approach instead and not waste time on that and spam the
> mail list.

I suppose a lot of this is specific to zram, but bits and pieces of it
sound upstreamable to me :)

We can wait for YoungJun's patches/RFC for further discussion, but perhaps:

1. A new cgroup interface to select swap backends for a cgroup.

2. Writeback/fallback order either designated by the above interface,
or by the priority of the swap backends.


>
> I have no special preference on how the final upstream interface
> should look like. But currently SWAP devices already have priorities,
> so maybe we should just make use of that.