linux-kernel - Re: [LSF/MM/BPF TOPIC] Virtual Swap Space

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=MEunn2zhPi1Rmof2V1ShWud03X2-vFAgFTRHLRw2R8PQ@mail.gmail.com>
Date: Fri, 17 Jan 2025 09:47:00 +0700
From: Nhat Pham <nphamcs@...il.com>
To: Yosry Ahmed <yosryahmed@...gle.com>
Cc: lsf-pc@...ts.linux-foundation.org, akpm@...ux-foundation.org, 
	hannes@...xchg.org, ryncsn@...il.com, chengming.zhou@...ux.dev, 
	chrisl@...nel.org, linux-mm@...ck.org, kernel-team@...a.com, 
	linux-kernel@...r.kernel.org, shakeel.butt@...ux.dev, hch@...radead.org, 
	hughd@...gle.com, 21cnbao@...il.com, usamaarif642@...il.com
Subject: Re: [LSF/MM/BPF TOPIC] Virtual Swap Space

On Fri, Jan 17, 2025 at 1:48 AM Yosry Ahmed <yosryahmed@...gle.com> wrote:
>
> On Thu, Jan 16, 2025 at 1:22 AM Nhat Pham <nphamcs@...il.com> wrote:
> >
> > My apologies if I missed any interested party in the cc list -
> > hopefully the mailing lists cc's suffice :)
> >
> > I'd like to (re-)propose the topic of swap abstraction layer for the
> > conference, as a continuation of Yosry's proposals at LSFMMBPF 2023
> > (see [1], [2], [3]).
> >
> > (AFAICT, the same idea has been floated by Rik van Riel since at
> > least 2011 - see [8]).
> >
> > I have a working(-ish) prototype, which hopefully will be
> > submission-ready soon. For now, I'd like to give the motivation/context
> > for the topic, as well as some high level design:
>
> I would obviously be interested in attending this, albeit virtually if
> possible. Just sharing some random thoughts below from my cold cache.

Your inputs are always appreciated :)

>
> >
> > I. Motivation
> >
> > Currently, when an anon page is swapped out, a slot in a backing swap
> > device is allocated and stored in the page table entries that refer to
> > the original page. This slot is also used as the "key" to find the
> > swapped out content, as well as the index to swap data structures, such
> > as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> > backing slot in this way is performant and efficient when swap is purely
> > just disk space, and swapoff is rare.
> >
> > However, the advent of many swap optimizations has exposed major
> > drawbacks of this design. The first problem is that we occupy a physical
> > slot in the swap space, even for pages that are NEVER expected to hit
> > the disk: pages compressed and stored in the zswap pool, zero-filled
> > pages, or pages rejected by both of these optimizations when zswap
> > writeback is disabled. This is the arguably central shortcoming of
> > zswap:
> > * In deployments when no disk space can be afforded for swap (such as
> >   mobile and embedded devices), users cannot adopt zswap, and are forced
> >   to use zram. This is confusing for users, and creates extra burdens
> >   for developers, having to develop and maintain similar features for
> >   two separate swap backends (writeback, cgroup charging, THP support,
> >   etc.). For instance, see the discussion in [4].
> > * Resource-wise, it is hugely wasteful in terms of disk usage, and
> >   limits the memory saving potentials of these optimizations by the
> >   static size of the swapfile, especially in high memory systems that
> >   can have up to terabytes worth of memory. It also creates significant
> >   challenges for users who rely on swap utilization as an early OOM
> >   signal.
> >
> > Another motivation for a swap redesign is to simplify swapoff, which
> > is complicated and expensive in the current design. Tight coupling
> > between a swap entry and its backing storage means that it requires a
> > whole page table walk to update all the page table entries that refer to
> > this swap entry, as well as updating all the associated swap data
> > structures (swap cache, etc.).
> >
> >
> > II. High Level Design Overview
> >
> > To fix the aforementioned issues, we need an abstraction that separates
> > a swap entry from its physical backing storage. IOW, we need to
> > “virtualize” the swap space: swap clients will work with a virtual swap
> > slot (that is dynamically allocated on-demand), storing it in page
> > table entries, and using it to index into various swap-related data
> > structures.
> >
> > The backing storage is decoupled from this slot, and the newly
> > introduced layer will “resolve” the ID to the actual storage, as well
> > as cooperating with the swap cache to handle all the required
> > synchronization. This layer also manages other metadata of the swap
> > entry, such as its lifetime information (swap count), via a dynamically
> > allocated per-entry swap descriptor:
>
> Do you plan to allocate one per-folio or per-page? I suppose it's
> per-page based on the design, but I am wondering if you explored
> having it per-folio. To make it work we'd need to support splitting a
> swp_desc, and figuring out which slot or zswap_entry corresponds to a
> certain page in a folio

Per-page, for now. Per-folio requires allocating these swp_descs on
huge page splitting etc., which is more complex.

And yeah, we need to chain these zswap_entry's somehow. Not impossible
certainly, but more overhead and more complexity :)

>
> >
> > struct swp_desc {
> >         swp_entry_t vswap;
> >         union {
> >                 swp_slot_t slot;
> >                 struct folio *folio;
> >                 struct zswap_entry *zswap_entry;
> >         };
> >         struct rcu_head rcu;
> >
> >         rwlock_t lock;
> >         enum swap_type type;
> >
> > #ifdef CONFIG_MEMCG
> >         atomic_t memcgid;
> > #endif
> >
> >         atomic_t in_swapcache;
> >         struct kref refcnt;
> >         atomic_t swap_count;
> > };
>
> That seems a bit large. I am assuming this is for the purpose of the
> prototype and we can reduce its size eventually, right?

Yup. I copied and pasted this from the prototype. Originally I
squeezed all the state (in_swapcache and the swap type) in an
integer-type "flag" field + 1 separate swap count field, and protected
them all with a single rw lock. That gets really ugly/confusing, so
for the sake of the prototype I just separate them all out in their
own fields, and play with atomicity to see if it's possible to do
things locklessly. So far so good (i.e no crashes yet), but the final
form is TBD :) Maybe we can discuss in closer details once I send out
this prototype as an RFC?

(I will say though it looks cleaner when all these fields are
separated. So it's going to be a tradeoff in that sense too).

>
> Particularly, I remember looking into merging the swap_count and
> refcnt, and I am not sure what in_swapcache is (is this a bit? Why
> can't we use a bit from swap_count?).

Yup. That's a single bit - it's a (partial) replacement for
SWAP_HAS_CACHE state in the existing swap map.

No particular reason why we can't squeeze it into swap counts other
than clarity :) It's going to be a bit annoying working with swap
count values (swap count increment is now * 2 instead of ++ etc.).

>
> I also think we can shove the swap_type in the low bits of the
> pointers (with some finesse for swp_slot_t), and the locking could be
> made less granular (I remember exploring going completely lockless,
> but I don't remember how that turned out).

Ah nice, I did not think about that. There are 4 types, so we need at
least 2 bits for typing. Should be doable, but we need to double check
the size of the physical (i.e on swapfile) swap slot handle though.

>
> >
> >
> > This design allows us to:
> > * Decouple zswap (and zeromapped swap entry) from backing swapfile:
> >   simply associate the swap ID with one of the supported backends: a
> >   zswap entry, a zero-filled swap page, a slot on the swapfile, or a
> >   page in memory .
> > * Simplify and optimize swapoff: we only have to fault the page in and
> >   have the swap ID points to the page instead of the on-disk swap slot.
> >   No need to perform any page table walking :)
>
> It also allows us to delete the complex swap count continuation code.

Yep. FWIW, in the case of swap continuation complexity at least is for
space optimization, whereas in the swapoff case we're limited by the
architecture and cannot really do better complexity- or
efficiency-wise, so I decided to highlight the swapoff simplification
first. But you're right, we can choose not to keep the swap
continuation in the new design (it's what I'm doing in the prototype
at least).