linux-kernel - Re: [RFC PATCH 00/14] Virtual Swap Space

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=NQyDqNBoS2kPePZO1iTkt88MgrtEKexxu7uLhaeA6rsQ@mail.gmail.com>
Date: Tue, 22 Apr 2025 10:15:20 -0700
From: Nhat Pham <nphamcs@...il.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org, hannes@...xchg.org, 
	hughd@...gle.com, mhocko@...nel.org, roman.gushchin@...ux.dev, 
	shakeel.butt@...ux.dev, muchun.song@...ux.dev, len.brown@...el.com, 
	chengming.zhou@...ux.dev, kasong@...cent.com, chrisl@...nel.org, 
	huang.ying.caritas@...il.com, ryan.roberts@....com, viro@...iv.linux.org.uk, 
	baohua@...nel.org, osalvador@...e.de, lorenzo.stoakes@...cle.com, 
	christophe.leroy@...roup.eu, pavel@...nel.org, kernel-team@...a.com, 
	linux-kernel@...r.kernel.org, cgroups@...r.kernel.org, 
	linux-pm@...r.kernel.org
Subject: Re: [RFC PATCH 00/14] Virtual Swap Space

On Tue, Apr 22, 2025 at 8:03 AM Yosry Ahmed <yosry.ahmed@...ux.dev> wrote:
>
> On Mon, Apr 07, 2025 at 04:42:01PM -0700, Nhat Pham wrote:
> It's exciting to see this proposal materilizing :)
>
> I didn't get a chance to look too closely at the code, but I have a few
> high-level comments.
>
> Do we need separate refcnt and swap_count? I am aware that there are
> cases where we need to hold a reference to prevent the descriptor from
> going away, without an extra page table entry referencing the swap
> descriptor -- but I am wondering if we can get away by just incrementing
> the swap count in these cases too? Would this mess things up?

Actually, you're right - we might not even need a separate refcnt
field at all :) Here's my original thought process:

1. We need something that keeps the virtual swap slot and its metadata
data structure (the swap descriptor) valid while we work with it.

2. In the old design, this is all stored at the swap device, so we
need to obtain a reference to the swap device itself.

3. In the new design, this is no longer even possible. The backend
might change under us even! So the refcnting needs to be done at the
virtual swap level.

3. The refcnting needs to be separate from the swap count field,
because certain operations/optimizations do check for the actual swap
count, and incrementing the swap count willy nilly like that might
accidentally throw these off. Think readahead-induced swap reads, for
example. So I need a separate refcnt field that takes into account 3
sources: PTE references (swap count), swap cache, and "ephemeral" (i.e
temporary) references, that replace the role of the swap device
reference in the old design.

However, I have thought more about it. I don't think I need to obtain
any ephemeral reference. I do need a refcnting mechanism, but one
atomic field (that stores both the swap count and swap cache pin)
should suffice.

Refcnt + RCU should already guarantee the existence of the swap
descriptor while I work with it. So there won't be any UAF issue, as
long as I am disciplined and check if the swap descriptor still exists
etc. in the virtual swap implementation, which I already am doing
anyway.

This should be safe enough, even in the face of swapoff, because
swapoff also relies on the same reference counting mechanism to free
the virtual swap slot and its descriptor. It tries to swap_free() the
virtual swap slot, as it unmaps the virtual swap slot from the page
table entry, which will decrement the swap count. So we're all good on
this front.

We DO need to obtain a reference to the swap device in certain places
though, if we want to use it down the line for some sort of
optimizations (for example, to look at its swap device flags to check
if it is a SWP_SYNCHRONOUS_IO device - see do_swap_page()). But this
is a separate matter.

The end result is I will reduce 4 fields:

1. swp_entry_t vswap
2. atomic_t in_swapcache
3. atomic_t swap_count
4. struct kref kref;

Into a single swap_refs field.


>
> >
> > This design allows us to:
> > * Decouple zswap (and zeromapped swap entry) from backing swapfile:
> >   simply associate the virtual swap slot with one of the supported
> >   backends: a zswap entry, a zero-filled swap page, a slot on the
> >   swapfile, or an in-memory page .
> > * Simplify and optimize swapoff: we only have to fault the page in and
> >   have the virtual swap slot points to the page instead of the on-disk
> >   physical swap slot. No need to perform any page table walking.
> >
> > Please see the attached patches for implementation details.
> >
> > Note that I do not remove the old implementation for now. Users can
> > select between the old and the new implementation via the
> > CONFIG_VIRTUAL_SWAP build config. This will also allow us to land the
> > new design, and iteratively optimize upon it (without having to include
> > everything in an even more massive patch series).
>
> I know this is easier, but honestly I'd prefer if we do an incremental
> replacement (if possible) rather than introducing a new implementation
> and slowly deprecating the old one, which historically doesn't seem to
> go well :P

I know, I know :P

>
> Once the series is organized as Johannes suggested, and we have better
> insights into how this will be integrated with Kairui's work, it should
> be clearer whether it's possible to incrementally update the current
> implemetation rather than add a parallel implementation.

Will take a look at Kairui's work when it's available :)

>
> >
> > III. Future Use Cases
> >
> > Other than decoupling swap backends and optimizing swapoff, this new
> > design allows us to implement the following more easily and
> > efficiently:
> >
> > * Multi-tier swapping (as mentioned in [5]), with transparent
> >   transferring (promotion/demotion) of pages across tiers (see [8] and
> >   [9]). Similar to swapoff, with the old design we would need to
> >   perform the expensive page table walk.
> > * Swapfile compaction to alleviate fragmentation (as proposed by Ying
> >   Huang in [6]).
> > * Mixed backing THP swapin (see [7]): Once you have pinned down the
> >   backing store of THPs, then you can dispatch each range of subpages
> >   to appropriate swapin handle.
> > * Swapping a folio out with discontiguous physical swap slots (see [10])
> >
> >
> > IV. Potential Issues
> >
> > Here is a couple of issues I can think of, along with some potential
> > solutions:
> >
> > 1. Space overhead: we need one swap descriptor per swap entry.
> > * Note that this overhead is dynamic, i.e only incurred when we actually
> >   need to swap a page out.
> > * It can be further offset by the reduction of swap map and the
> >   elimination of zeromapped bitmap.
> >
> > 2. Lock contention: since the virtual swap space is dynamic/unbounded,
> > we cannot naively range partition it anymore. This can increase lock
> > contention on swap-related data structures (swap cache, zswap’s xarray,
> > etc.).
> > * The problem is slightly alleviated by the lockless nature of the new
> >   reference counting scheme, as well as the per-entry locking for
> >   backing store information.
> > * Johannes suggested that I can implement a dynamic partition scheme, in
> >   which new partitions (along with associated data structures) are
> >   allocated on demand. It is one extra layer of indirection, but global
> >   locking will only be done only on partition allocation, rather than on
> >   each access. All other accesses only take local (per-partition)
> >   locks, or are completely lockless (such as partition lookup).
> >
> >
> > V. Benchmarking
> >
> > As a proof of concept, I run the prototype through some simple
> > benchmarks:
> >
> > 1. usemem: 16 threads, 2G each, memory.max = 16G
> >
> > I benchmarked the following usemem commands:
> >
> > time usemem --init-time -w -O -s 10 -n 16 2g
> >
> > Baseline:
> > real: 33.96s
> > user: 25.31s
> > sys: 341.09s
> > average throughput: 111295.45 KB/s
> > average free time: 2079258.68 usecs
> >
> > New Design:
> > real: 35.87s
> > user: 25.15s
> > sys: 373.01s
> > average throughput: 106965.46 KB/s
> > average free time: 3192465.62 usecs
> >
> > To root cause this regression, I ran perf on the usemem program, as
> > well as on the following stress-ng program:
> >
> > perf record -ag -e cycles -G perf_cg -- ./stress-ng/stress-ng  --pageswap $(nproc) --pageswap-ops 100000
> >
> > and observed the (predicted) increase in lock contention on swap cache
> > accesses. This regression is alleviated if I put together the
> > following hack: limit the virtual swap space to a sufficient size for
> > the benchmark, range partition the swap-related data structures (swap
> > cache, zswap tree, etc.) based on the limit, and distribute the
> > allocation of virtual swap slotss among these partitions (on a per-CPU
> > basis):
> >
> > real: 34.94s
> > user: 25.28s
> > sys: 360.25s
> > average throughput: 108181.15 KB/s
> > average free time: 2680890.24 usecs
> >
> > As mentioned above, I will implement proper dynamic swap range
> > partitioning in a follow up work.
>
> I thought there would be some improvements with the new design once the
> lock contention is gone, due to the colocation of all swap metadata. Do
> we know why this isn't the case?

The lock contention is reduced on access, but increased on allocation
and free step (because we have to go through a global lock now due to
the loss of swap space partitioning).

Virtual swap allocation optimization will be the next step, or it can
be done concurrently, if we can figure out a way to make Kairui's work
compatible with this.

>
> Also, one missing key metric in this cover letter is disk space savings.
> It would be useful if you can give a realistic example about how much
> disk space is being provisioned and wasted today to effictively use
> zswap, and how much this can decrease with this design.
>
> I believe the disk space savings are one of the main motivations so
> let's showcase that :)

Will do - I'm more concerned about regressions, so I wanna throw it
out there right away to get ideas/feedback.

>