linux-kernel - Re: [RFC PATCH 00/14] Virtual Swap Space

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aAeuAwlST1sNifBs@Asmaa.>
Date: Tue, 22 Apr 2025 07:56:03 -0700
From: Yosry Ahmed <yosry.ahmed@...ux.dev>
To: Nhat Pham <nphamcs@...il.com>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org, hannes@...xchg.org,
	hughd@...gle.com, mhocko@...nel.org, roman.gushchin@...ux.dev,
	shakeel.butt@...ux.dev, muchun.song@...ux.dev, len.brown@...el.com,
	chengming.zhou@...ux.dev, kasong@...cent.com, chrisl@...nel.org,
	huang.ying.caritas@...il.com, ryan.roberts@....com,
	viro@...iv.linux.org.uk, baohua@...nel.org, osalvador@...e.de,
	lorenzo.stoakes@...cle.com, christophe.leroy@...roup.eu,
	pavel@...nel.org, kernel-team@...a.com,
	linux-kernel@...r.kernel.org, cgroups@...r.kernel.org,
	linux-pm@...r.kernel.org
Subject: Re: [RFC PATCH 00/14] Virtual Swap Space

On Mon, Apr 07, 2025 at 04:42:01PM -0700, Nhat Pham wrote:
> This RFC implements the virtual swap space idea, based on Yosry's
> proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
> inputs from Johannes Weiner. The same idea (with different
> implementation details) has been floated by Rik van Riel since at least
> 2011 (see [8]).
> 
> The code attached to this RFC is purely a prototype. It is not 100%
> merge-ready (see section VI for future work). I do, however, want to show
> people this prototype/RFC, including all the bells and whistles and a
> couple of actual use cases, so that folks can see what the end results
> will look like, and give me early feedback :)
> 
> I. Motivation
> 
> Currently, when an anon page is swapped out, a slot in a backing swap
> device is allocated and stored in the page table entries that refer to
> the original page. This slot is also used as the "key" to find the
> swapped out content, as well as the index to swap data structures, such
> as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> backing slot in this way is performant and efficient when swap is purely
> just disk space, and swapoff is rare.
> 
> However, the advent of many swap optimizations has exposed major
> drawbacks of this design. The first problem is that we occupy a physical
> slot in the swap space, even for pages that are NEVER expected to hit
> the disk: pages compressed and stored in the zswap pool, zero-filled
> pages, or pages rejected by both of these optimizations when zswap
> writeback is disabled. This is the arguably central shortcoming of
> zswap:
> * In deployments when no disk space can be afforded for swap (such as
>   mobile and embedded devices), users cannot adopt zswap, and are forced
>   to use zram. This is confusing for users, and creates extra burdens
>   for developers, having to develop and maintain similar features for
>   two separate swap backends (writeback, cgroup charging, THP support,
>   etc.). For instance, see the discussion in [4].
> * Resource-wise, it is hugely wasteful in terms of disk usage, and
>   limits the memory saving potentials of these optimizations by the
>   static size of the swapfile, especially in high memory systems that
>   can have up to terabytes worth of memory. It also creates significant
>   challenges for users who rely on swap utilization as an early OOM
>   signal.
> 
> Another motivation for a swap redesign is to simplify swapoff, which
> is complicated and expensive in the current design. Tight coupling
> between a swap entry and its backing storage means that it requires a
> whole page table walk to update all the page table entries that refer to
> this swap entry, as well as updating all the associated swap data
> structures (swap cache, etc.).
> 
> 
> II. High Level Design Overview
> 
> To fix the aforementioned issues, we need an abstraction that separates
> a swap entry from its physical backing storage. IOW, we need to
> “virtualize” the swap space: swap clients will work with a dynamically
> allocated virtual swap slot, storing it in page table entries, and
> using it to index into various swap-related data structures. The
> backing storage is decoupled from the virtual swap slot, and the newly
> introduced layer will “resolve” the virtual swap slot to the actual
> storage. This layer also manages other metadata of the swap entry, such
> as its lifetime information (swap count), via a dynamically allocated
> per-swap-entry descriptor:
> 
> struct swp_desc {
> 	swp_entry_t vswap;
> 	union {
> 		swp_slot_t slot;
> 		struct folio *folio;
> 		struct zswap_entry *zswap_entry;
> 	};
> 	struct rcu_head rcu;
> 
> 	rwlock_t lock;
> 	enum swap_type type;
> 
> 	atomic_t memcgid;
> 
> 	atomic_t in_swapcache;
> 	struct kref refcnt;
> 	atomic_t swap_count;
> };

It's exciting to see this proposal materilizing :)

I didn't get a chance to look too closely at the code, but I have a few
high-level comments.

Do we need separate refcnt and swap_count? I am aware that there are
cases where we need to hold a reference to prevent the descriptor from
going away, without an extra page table entry referencing the swap
descriptor -- but I am wondering if we can get away by just incrementing
the swap count in these cases too? Would this mess things up?

> 
> This design allows us to:
> * Decouple zswap (and zeromapped swap entry) from backing swapfile:
>   simply associate the virtual swap slot with one of the supported
>   backends: a zswap entry, a zero-filled swap page, a slot on the
>   swapfile, or an in-memory page .
> * Simplify and optimize swapoff: we only have to fault the page in and
>   have the virtual swap slot points to the page instead of the on-disk
>   physical swap slot. No need to perform any page table walking.
> 
> Please see the attached patches for implementation details.
> 
> Note that I do not remove the old implementation for now. Users can
> select between the old and the new implementation via the
> CONFIG_VIRTUAL_SWAP build config. This will also allow us to land the
> new design, and iteratively optimize upon it (without having to include
> everything in an even more massive patch series).

I know this is easier, but honestly I'd prefer if we do an incremental
replacement (if possible) rather than introducing a new implementation
and slowly deprecating the old one, which historically doesn't seem to
go well :P

Once the series is organized as Johannes suggested, and we have better
insights into how this will be integrated with Kairui's work, it should
be clearer whether it's possible to incrementally update the current
implemetation rather than add a parallel implementation.

> 
> III. Future Use Cases
> 
> Other than decoupling swap backends and optimizing swapoff, this new
> design allows us to implement the following more easily and
> efficiently:
> 
> * Multi-tier swapping (as mentioned in [5]), with transparent
>   transferring (promotion/demotion) of pages across tiers (see [8] and
>   [9]). Similar to swapoff, with the old design we would need to
>   perform the expensive page table walk.
> * Swapfile compaction to alleviate fragmentation (as proposed by Ying
>   Huang in [6]).
> * Mixed backing THP swapin (see [7]): Once you have pinned down the
>   backing store of THPs, then you can dispatch each range of subpages
>   to appropriate swapin handle.
> * Swapping a folio out with discontiguous physical swap slots (see [10])
> 
> 
> IV. Potential Issues
> 
> Here is a couple of issues I can think of, along with some potential
> solutions:
> 
> 1. Space overhead: we need one swap descriptor per swap entry.
> * Note that this overhead is dynamic, i.e only incurred when we actually
>   need to swap a page out.
> * It can be further offset by the reduction of swap map and the
>   elimination of zeromapped bitmap.
> 
> 2. Lock contention: since the virtual swap space is dynamic/unbounded,
> we cannot naively range partition it anymore. This can increase lock
> contention on swap-related data structures (swap cache, zswap’s xarray,
> etc.).
> * The problem is slightly alleviated by the lockless nature of the new
>   reference counting scheme, as well as the per-entry locking for
>   backing store information.
> * Johannes suggested that I can implement a dynamic partition scheme, in
>   which new partitions (along with associated data structures) are
>   allocated on demand. It is one extra layer of indirection, but global
>   locking will only be done only on partition allocation, rather than on
>   each access. All other accesses only take local (per-partition)
>   locks, or are completely lockless (such as partition lookup).
> 
> 
> V. Benchmarking
> 
> As a proof of concept, I run the prototype through some simple
> benchmarks:
> 
> 1. usemem: 16 threads, 2G each, memory.max = 16G
> 
> I benchmarked the following usemem commands:
> 
> time usemem --init-time -w -O -s 10 -n 16 2g
> 
> Baseline:
> real: 33.96s
> user: 25.31s
> sys: 341.09s
> average throughput: 111295.45 KB/s
> average free time: 2079258.68 usecs
> 
> New Design:
> real: 35.87s
> user: 25.15s
> sys: 373.01s
> average throughput: 106965.46 KB/s
> average free time: 3192465.62 usecs
> 
> To root cause this regression, I ran perf on the usemem program, as
> well as on the following stress-ng program:
> 
> perf record -ag -e cycles -G perf_cg -- ./stress-ng/stress-ng  --pageswap $(nproc) --pageswap-ops 100000
> 
> and observed the (predicted) increase in lock contention on swap cache
> accesses. This regression is alleviated if I put together the
> following hack: limit the virtual swap space to a sufficient size for
> the benchmark, range partition the swap-related data structures (swap
> cache, zswap tree, etc.) based on the limit, and distribute the
> allocation of virtual swap slotss among these partitions (on a per-CPU
> basis):
> 
> real: 34.94s
> user: 25.28s
> sys: 360.25s
> average throughput: 108181.15 KB/s
> average free time: 2680890.24 usecs
> 
> As mentioned above, I will implement proper dynamic swap range
> partitioning in a follow up work.

I thought there would be some improvements with the new design once the
lock contention is gone, due to the colocation of all swap metadata. Do
we know why this isn't the case?

Also, one missing key metric in this cover letter is disk space savings.
It would be useful if you can give a realistic example about how much
disk space is being provisioned and wasted today to effictively use
zswap, and how much this can decrease with this design.

I believe the disk space savings are one of the main motivations so
let's showcase that :)