linux-kernel - Re: [PATCH v3 00/20] Virtual Swap Space

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aYt-havagZ8kUmov@cmpxchg.org>
Date: Tue, 10 Feb 2026 13:52:53 -0500
From: Johannes Weiner <hannes@...xchg.org>
To: Kairui Song <ryncsn@...il.com>
Cc: Nhat Pham <nphamcs@...il.com>, linux-mm@...ck.org,
	akpm@...ux-foundation.org, hughd@...gle.com, yosry.ahmed@...ux.dev,
	mhocko@...nel.org, roman.gushchin@...ux.dev, shakeel.butt@...ux.dev,
	muchun.song@...ux.dev, len.brown@...el.com,
	chengming.zhou@...ux.dev, chrisl@...nel.org,
	huang.ying.caritas@...il.com, ryan.roberts@....com,
	shikemeng@...weicloud.com, viro@...iv.linux.org.uk,
	baohua@...nel.org, bhe@...hat.com, osalvador@...e.de,
	christophe.leroy@...roup.eu, pavel@...nel.org, kernel-team@...a.com,
	linux-kernel@...r.kernel.org, cgroups@...r.kernel.org,
	linux-pm@...r.kernel.org, peterx@...hat.com, riel@...riel.com,
	joshua.hahnjy@...il.com, npache@...hat.com, gourry@...rry.net,
	axelrasmussen@...gle.com, yuanchu@...gle.com, weixugc@...gle.com,
	rafael@...nel.org, jannh@...gle.com, pfalcato@...e.de,
	zhengqi.arch@...edance.com
Subject: Re: [PATCH v3 00/20] Virtual Swap Space

Hello Kairui,

On Wed, Feb 11, 2026 at 01:59:34AM +0800, Kairui Song wrote:
> On Mon, Feb 9, 2026 at 7:57 AM Nhat Pham <nphamcs@...il.com> wrote:
> >     * Reducing the size of the swap descriptor from 48 bytes to 24
> >       bytes, i.e another 50% reduction in memory overhead from v2.
> 
> Honestly if you keep reducing that you might just end up
> reimplementing the swap table format :)

Yeah, it turns out we need the same data points to describe and track
a swapped out page ;)

> > * Operationally, static provisioning the swapfile for zswap pose
> >   significant challenges, because the sysadmin has to prescribe how
> >   much swap is needed a priori, for each combination of
> >   (memory size x disk space x workload usage). It is even more
> >   complicated when we take into account the variance of memory
> >   compression, which changes the reclaim dynamics (and as a result,
> >   swap space size requirement). The problem is further exarcebated for
> >   users who rely on swap utilization (and exhaustion) as an OOM signal.
> 
> So I thought about it again, this one seems not to be an issue. In
> most cases, having a 1:1 virtual swap setup is enough, and very soon
> the static overhead will be really trivial. There won't even be any
> fragmentation issue either, since if the physical memory size is
> identical to swap space, then you can always find a matching part. And
> besides, dynamic growth of swap files is actually very doable and
> useful, that will make physical swap files adjustable at runtime, so
> users won't need to waste a swap type id to extend physical swap
> space.

The issue is address space separation. We don't want things inside the
compressed pool to consume disk space; nor do we want entries that
live on disk to take usable space away from the compressed pool.

The regression reports are fair, thanks for highlighting those. And
whether to make this optional is also a fair discussion.

But some of the numbers comparisons really strike me as apples to
oranges comparisons. It seems to miss the core issue this series is
trying to address.

> > * Another motivation is to simplify swapoff, which is both complicated
> >   and expensive in the current design, precisely because we are storing
> >   an encoding of the backend positional information in the page table,
> >   and thus requires a full page table walk to remove these references.
> 
> The swapoff here is not really a clean swapoff, minor faults will
> still be triggered afterwards, and metadata is not released. So this
> new swapoff cannot really guarantee the same performance as the old
> swapoff.

That seems very academic to me. The goal is to relinquish disk space,
and these patches make that a lot faster.

Let's put it the other way round: if today we had a fast swapoff read
sequence with lazy minor faults to resolve page tables, would we
accept patches that implement the expensive try_to_unuse() scans and
make it mandatory? Considering the worst-case runtime it can cause?

I don't think so. We have this scan because the page table references
are pointing to disk slots, and this is the only way to free them.

> And on the other hand we can already just read everything
> into the swap cache then ignore the page table walk with the older
> design too, that's just not a clean swapoff.

How can you relinquish the disk slot as long as the swp_entry_t is in
circulation?