[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=OvuVPJzQsSQm8F+zsRgJFnbMmW2JMJbGebp=U8+jMRYA@mail.gmail.com>
Date: Sun, 8 Feb 2026 14:51:59 -0800
From: Nhat Pham <nphamcs@...il.com>
To: linux-mm@...ck.org
Cc: akpm@...ux-foundation.org, hannes@...xchg.org, hughd@...gle.com,
yosry.ahmed@...ux.dev, mhocko@...nel.org, roman.gushchin@...ux.dev,
shakeel.butt@...ux.dev, muchun.song@...ux.dev, len.brown@...el.com,
chengming.zhou@...ux.dev, kasong@...cent.com, chrisl@...nel.org,
huang.ying.caritas@...il.com, ryan.roberts@....com, shikemeng@...weicloud.com,
viro@...iv.linux.org.uk, baohua@...nel.org, bhe@...hat.com, osalvador@...e.de,
lorenzo.stoakes@...cle.com, christophe.leroy@...roup.eu, pavel@...nel.org,
kernel-team@...a.com, linux-kernel@...r.kernel.org, cgroups@...r.kernel.org,
linux-pm@...r.kernel.org, peterx@...hat.com, riel@...riel.com,
joshua.hahnjy@...il.com, npache@...hat.com, gourry@...rry.net,
axelrasmussen@...gle.com, yuanchu@...gle.com, weixugc@...gle.com,
rafael@...nel.org, jannh@...gle.com, pfalcato@...e.de,
zhengqi.arch@...edance.com
Subject: Re: [PATCH v3 00/20] Virtual Swap Space
On Sun, Feb 8, 2026 at 1:58 PM Nhat Pham <nphamcs@...il.com> wrote:
>
> Changelog:
> * RFC v2 -> v3:
> * Implement a cluster-based allocation algorithm for virtual swap
> slots, inspired by Kairui Song and Chris Li's implementation, as
> well as Johannes Weiner's suggestions. This eliminates the lock
> contention issues on the virtual swap layer.
> * Re-use swap table for the reverse mapping.
> * Remove CONFIG_VIRTUAL_SWAP.
> * Reducing the size of the swap descriptor from 48 bytes to 24
> bytes, i.e another 50% reduction in memory overhead from v2.
> * Remove swap cache and zswap tree and use the swap descriptor
> for this.
> * Remove zeromap, and replace the swap_map bytemap with 2 bitmaps
> (one for allocated slots, and one for bad slots).
> * Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449)
> * Update cover letter to include new benchmark results and discussion
> on overhead in various cases.
> * RFC v1 -> RFC v2:
> * Use a single atomic type (swap_refs) for reference counting
> purpose. This brings the size of the swap descriptor from 64 B
> down to 48 B (25% reduction). Suggested by Yosry Ahmed.
> * Zeromap bitmap is removed in the virtual swap implementation.
> This saves one bit per phyiscal swapfile slot.
> * Rearrange the patches and the code change to make things more
> reviewable. Suggested by Johannes Weiner.
> * Update the cover letter a bit.
>
> This patch series implements the virtual swap space idea, based on Yosry's
> proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
> inputs from Johannes Weiner. The same idea (with different
> implementation details) has been floated by Rik van Riel since at least
> 2011 (see [8]).
>
> This patch series is based on 6.19. There are a couple more
> swap-related changes in the mm-stable branch that I would need to
> coordinate with, but I would like to send this out as an update, to show
> that the lock contention issues that plagued earlier versions have been
> resolved and performance on the kernel build benchmark is now on-par with
> baseline. Furthermore, memory overhead has been substantially reduced
> compared to the last RFC version.
>
>
> I. Motivation
>
> Currently, when an anon page is swapped out, a slot in a backing swap
> device is allocated and stored in the page table entries that refer to
> the original page. This slot is also used as the "key" to find the
> swapped out content, as well as the index to swap data structures, such
> as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> backing slot in this way is performant and efficient when swap is purely
> just disk space, and swapoff is rare.
>
> However, the advent of many swap optimizations has exposed major
> drawbacks of this design. The first problem is that we occupy a physical
> slot in the swap space, even for pages that are NEVER expected to hit
> the disk: pages compressed and stored in the zswap pool, zero-filled
> pages, or pages rejected by both of these optimizations when zswap
> writeback is disabled. This is the arguably central shortcoming of
> zswap:
> * In deployments when no disk space can be afforded for swap (such as
> mobile and embedded devices), users cannot adopt zswap, and are forced
> to use zram. This is confusing for users, and creates extra burdens
> for developers, having to develop and maintain similar features for
> two separate swap backends (writeback, cgroup charging, THP support,
> etc.). For instance, see the discussion in [4].
> * Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
> we have swapfile in the order of tens to hundreds of GBs, which are
> mostly unused and only exist to enable zswap usage and zero-filled
> pages swap optimizations.
> * Tying zswap (and more generally, other in-memory swap backends) to
> the current physical swapfile infrastructure makes zswap implicitly
> statically sized. This does not make sense, as unlike disk swap, in
> which we consume a limited resource (disk space or swapfile space) to
> save another resource (memory), zswap consume the same resource it is
> saving (memory). The more we zswap, the more memory we have available,
> not less. We are not rationing a limited resource when we limit
> the size of he zswap pool, but rather we are capping the resource
> (memory) saving potential of zswap. Under memory pressure, using
> more zswap is almost always better than the alternative (disk IOs, or
> even worse, OOMs), and dynamically sizing the zswap pool on demand
> allows the system to flexibly respond to these precarious scenarios.
> * Operationally, static provisioning the swapfile for zswap pose
> significant challenges, because the sysadmin has to prescribe how
> much swap is needed a priori, for each combination of
> (memory size x disk space x workload usage). It is even more
> complicated when we take into account the variance of memory
> compression, which changes the reclaim dynamics (and as a result,
> swap space size requirement). The problem is further exarcebated for
> users who rely on swap utilization (and exhaustion) as an OOM signal.
>
> All of these factors make it very difficult to configure the swapfile
> for zswap: too small of a swapfile and we risk preventable OOMs and
> limit the memory saving potentials of zswap; too big of a swapfile
> and we waste disk space and memory due to swap metadata overhead.
> This dilemma becomes more drastic in high memory systems, which can
> have up to TBs worth of memory.
>
> Past attempts to decouple disk and compressed swap backends, namely the
> ghost swapfile approach (see [13]), as well as the alternative
> compressed swap backend zram, have mainly focused on eliminating the
> disk space usage of compressed backends. We want a solution that not
> only tackles that same problem, but also achieve the dyamicization of
> swap space to maximize the memory saving potentials while reducing
> operational and static memory overhead.
>
> Finally, any swap redesign should support efficient backend transfer,
> i.e without having to perform the expensive page table walk to
> update all the PTEs that refer to the swap entry:
> * The main motivation for this requirement is zswap writeback. To quote
> Johannes (from [14]): "Combining compression with disk swap is
> extremely powerful, because it dramatically reduces the worst aspects
> of both: it reduces the memory footprint of compression by shedding
> the coldest data to disk; it reduces the IO latencies and flash wear
> of disk swap through the writeback cache. In practice, this reduces
> *average event rates of the entire reclaim/paging/IO stack*."
> * Another motivation is to simplify swapoff, which is both complicated
> and expensive in the current design, precisely because we are storing
> an encoding of the backend positional information in the page table,
> and thus requires a full page table walk to remove these references.
>
>
> II. High Level Design Overview
>
> To fix the aforementioned issues, we need an abstraction that separates
> a swap entry from its physical backing storage. IOW, we need to
> “virtualize” the swap space: swap clients will work with a dynamically
> allocated virtual swap slot, storing it in page table entries, and
> using it to index into various swap-related data structures. The
> backing storage is decoupled from the virtual swap slot, and the newly
> introduced layer will “resolve” the virtual swap slot to the actual
> storage. This layer also manages other metadata of the swap entry, such
> as its lifetime information (swap count), via a dynamically allocated,
> per-swap-entry descriptor:
>
> struct swp_desc {
> union {
> swp_slot_t slot; /* 0 8 */
> struct zswap_entry * zswap_entry; /* 0 8 */
> }; /* 0 8 */
> union {
> struct folio * swap_cache; /* 8 8 */
> void * shadow; /* 8 8 */
> }; /* 8 8 */
> unsigned int swap_count; /* 16 4 */
> unsigned short memcgid:16; /* 20: 0 2 */
> bool in_swapcache:1; /* 22: 0 1 */
>
> /* Bitfield combined with previous fields */
>
> enum swap_type type:2; /* 20:17 4 */
>
> /* size: 24, cachelines: 1, members: 6 */
> /* bit_padding: 13 bits */
> /* last cacheline: 24 bytes */
> };
>
> (output from pahole).
>
> This design allows us to:
> * Decouple zswap (and zeromapped swap entry) from backing swapfile:
> simply associate the virtual swap slot with one of the supported
> backends: a zswap entry, a zero-filled swap page, a slot on the
> swapfile, or an in-memory page.
> * Simplify and optimize swapoff: we only have to fault the page in and
> have the virtual swap slot points to the page instead of the on-disk
> physical swap slot. No need to perform any page table walking.
>
> The size of the virtual swap descriptor is 24 bytes. Note that this is
> not all "new" overhead, as the swap descriptor will replace:
> * the swap_cgroup arrays (one per swap type) in the old design, which
> is a massive source of static memory overhead. With the new design,
> it is only allocated for used clusters.
> * the swap tables, which holds the swap cache and workingset shadows.
> * the zeromap bitmap, which is a bitmap of physical swap slots to
> indicate whether the swapped out page is zero-filled or not.
> * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
> one for allocated slots, and one for bad slots, representing 3 possible
> states of a slot on the swapfile: allocated, free, and bad.
> * the zswap tree.
>
> So, in terms of additional memory overhead:
> * For zswap entries, the added memory overhead is rather minimal. The
> new indirection pointer neatly replaces the existing zswap tree.
> We really only incur less than one word of overhead for swap count
> blow up (since we no longer use swap continuation) and the swap type.
> * For physical swap entries, the new design will impose fewer than 3 words
> memory overhead. However, as noted above this overhead is only for
> actively used swap entries, whereas in the current design the overhead is
> static (including the swap cgroup array for example).
>
> The primary victim of this overhead will be zram users. However, as
> zswap now no longer takes up disk space, zram users can consider
> switching to zswap (which, as a bonus, has a lot of useful features
> out of the box, such as cgroup tracking, dynamic zswap pool sizing,
> LRU-ordering writeback, etc.).
>
> For a more concrete example, suppose we have a 32 GB swapfile (i.e.
> 8,388,608 swap entries), and we use zswap.
>
> 0% usage, or 0 entries: 0.00 MB
> * Old design total overhead: 25.00 MB
> * Vswap total overhead: 0.00 MB
>
> 25% usage, or 2,097,152 entries:
> * Old design total overhead: 57.00 MB
> * Vswap total overhead: 48.25 MB
>
> 50% usage, or 4,194,304 entries:
> * Old design total overhead: 89.00 MB
> * Vswap total overhead: 96.50 MB
>
> 75% usage, or 6,291,456 entries:
> * Old design total overhead: 121.00 MB
> * Vswap total overhead: 144.75 MB
>
> 100% usage, or 8,388,608 entries:
> * Old design total overhead: 153.00 MB
> * Vswap total overhead: 193.00 MB
>
> So even in the worst case scenario for virtual swap, i.e when we
> somehow have an oracle to correctly size the swapfile for zswap
> pool to 32 GB, the added overhead is only 40 MB, which is a mere
> 0.12% of the total swapfile :)
>
> In practice, the overhead will be closer to the 50-75% usage case, as
> systems tend to leave swap headroom for pathological events or sudden
> spikes in memory requirements. The added overhead in these cases are
> practically neglible. And in deployments where swapfiles for zswap
> are previously sparsely used, switching over to virtual swap will
> actually reduce memory overhead.
>
> Doing the same math for the disk swap, which is the worst case for
> virtual swap in terms of swap backends:
>
> 0% usage, or 0 entries: 0.00 MB
> * Old design total overhead: 25.00 MB
> * Vswap total overhead: 2.00 MB
>
> 25% usage, or 2,097,152 entries:
> * Old design total overhead: 41.00 MB
> * Vswap total overhead: 66.25 MB
>
> 50% usage, or 4,194,304 entries:
> * Old design total overhead: 57.00 MB
> * Vswap total overhead: 130.50 MB
>
> 75% usage, or 6,291,456 entries:
> * Old design total overhead: 73.00 MB
> * Vswap total overhead: 194.75 MB
>
> 100% usage, or 8,388,608 entries:
> * Old design total overhead: 89.00 MB
> * Vswap total overhead: 259.00 MB
>
> The added overhead is 170MB, which is 0.5% of the total swapfile size,
> again in the worst case when we have a sizing oracle.
>
> Please see the attached patches for more implementation details.
>
>
> III. Usage and Benchmarking
>
> This patch series introduce no new syscalls or userspace API. Existing
> userspace setups will work as-is, except we no longer have to create a
> swapfile or set memory.swap.max if we want to use zswap, as zswap is no
> longer tied to physical swap. The zswap pool will be automatically and
> dynamically sized based on memory usage and reclaim dynamics.
>
> To measure the performance of the new implementation, I have run the
> following benchmarks:
>
> 1. Kernel building: 52 workers (one per processor), memory.max = 3G.
>
> Using zswap as the backend:
>
> Baseline:
> real: mean: 185.2s, stdev: 0.93s
> sys: mean: 683.7s, stdev: 33.77s
>
> Vswap:
> real: mean: 184.88s, stdev: 0.57s
> sys: mean: 675.14s, stdev: 32.8s
>
> We actually see a slight improvement in systime (by 1.5%) :) This is
> likely because we no longer have to perform swap charging for zswap
> entries, and virtual swap allocator is simpler than that of physical
> swap.
>
> Using SSD swap as the backend:
>
> Baseline:
> real: mean: 200.3s, stdev: 2.33s
> sys: mean: 489.88s, stdev: 9.62s
>
> Vswap:
> real: mean: 201.47s, stdev: 2.98s
> sys: mean: 487.36s, stdev: 5.53s
>
> The performance is neck-to-neck.
>
>
> IV. Future Use Cases
>
> While the patch series focus on two applications (decoupling swap
> backends and swapoff optimization/simplification), this new,
> future-proof design also allows us to implement new swap features more
> easily and efficiently:
>
> * Multi-tier swapping (as mentioned in [5]), with transparent
> transferring (promotion/demotion) of pages across tiers (see [8] and
> [9]). Similar to swapoff, with the old design we would need to
> perform the expensive page table walk.
> * Swapfile compaction to alleviate fragmentation (as proposed by Ying
> Huang in [6]).
> * Mixed backing THP swapin (see [7]): Once you have pinned down the
> backing store of THPs, then you can dispatch each range of subpages
> to appropriate backend swapin handler.
> * Swapping a folio out with discontiguous physical swap slots
> (see [10]).
> * Zswap writeback optimization: The current architecture pre-reserves
> physical swap space for pages when they enter the zswap pool, giving
> the kernel no flexibility at writeback time. With the virtual swap
> implementation, the backends are decoupled, and physical swap space
> is allocated on-demand at writeback time, at which point we can make
> much smarter decisions: we can batch multiple zswap writeback
> operations into a single IO request, allocating contiguous physical
> swap slots for that request. We can even perform compressed writeback
> (i.e writing these pages without decompressing them) (see [12]).
>
>
> V. References
>
> [1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
> [2]: https://lwn.net/Articles/932077/
> [3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
> [4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
> [5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
> [6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
> [7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
> [8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
> [9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
> [10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
> [11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
> [12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
> [13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel.org/
> [14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.org/
>
> Nhat Pham (20):
> mm/swap: decouple swap cache from physical swap infrastructure
> swap: rearrange the swap header file
> mm: swap: add an abstract API for locking out swapoff
> zswap: add new helpers for zswap entry operations
> mm/swap: add a new function to check if a swap entry is in swap
> cached.
> mm: swap: add a separate type for physical swap slots
> mm: create scaffolds for the new virtual swap implementation
> zswap: prepare zswap for swap virtualization
> mm: swap: allocate a virtual swap slot for each swapped out page
> swap: move swap cache to virtual swap descriptor
> zswap: move zswap entry management to the virtual swap descriptor
> swap: implement the swap_cgroup API using virtual swap
> swap: manage swap entry lifecycle at the virtual swap layer
> mm: swap: decouple virtual swap slot from backing store
> zswap: do not start zswap shrinker if there is no physical swap slots
> swap: do not unnecesarily pin readahead swap entries
> swapfile: remove zeromap bitmap
> memcg: swap: only charge physical swap slots
> swap: simplify swapoff using virtual swap
> swapfile: replace the swap map with bitmaps
>
> Documentation/mm/swap-table.rst | 69 --
> MAINTAINERS | 2 +
> include/linux/cpuhotplug.h | 1 +
> include/linux/mm_types.h | 16 +
> include/linux/shmem_fs.h | 7 +-
> include/linux/swap.h | 135 ++-
> include/linux/swap_cgroup.h | 13 -
> include/linux/swapops.h | 25 +
> include/linux/zswap.h | 17 +-
> kernel/power/swap.c | 6 +-
> mm/Makefile | 5 +-
> mm/huge_memory.c | 11 +-
> mm/internal.h | 12 +-
> mm/memcontrol-v1.c | 6 +
> mm/memcontrol.c | 142 ++-
> mm/memory.c | 101 +-
> mm/migrate.c | 13 +-
> mm/mincore.c | 15 +-
> mm/page_io.c | 83 +-
> mm/shmem.c | 215 +---
> mm/swap.h | 157 +--
> mm/swap_cgroup.c | 172 ---
> mm/swap_state.c | 306 +----
> mm/swap_table.h | 78 +-
> mm/swapfile.c | 1518 ++++-------------------
> mm/userfaultfd.c | 18 +-
> mm/vmscan.c | 28 +-
> mm/vswap.c | 2025 +++++++++++++++++++++++++++++++
> mm/zswap.c | 142 +--
> 29 files changed, 2853 insertions(+), 2485 deletions(-)
> delete mode 100644 Documentation/mm/swap-table.rst
> delete mode 100644 mm/swap_cgroup.c
> create mode 100644 mm/vswap.c
>
>
> base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
> --
> 2.47.3
Weirdly, it seems like the cover letter (and only the cover letter) is
not being delivered...
I'm trying to figure out what's going on :( My apologies for the
inconvenience...
Powered by blists - more mailing lists