[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJD7tkY+prnDUW3GRAgrVOPp27rCUtuEz8jCnr=cEkXndvqKCw@mail.gmail.com>
Date: Fri, 17 Jan 2025 08:51:41 -0800
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Nhat Pham <nphamcs@...il.com>
Cc: lsf-pc@...ts.linux-foundation.org, akpm@...ux-foundation.org,
hannes@...xchg.org, ryncsn@...il.com, chengming.zhou@...ux.dev,
chrisl@...nel.org, linux-mm@...ck.org, kernel-team@...a.com,
linux-kernel@...r.kernel.org, shakeel.butt@...ux.dev, hch@...radead.org,
hughd@...gle.com, 21cnbao@...il.com, usamaarif642@...il.com
Subject: Re: [LSF/MM/BPF TOPIC] Virtual Swap Space
On Thu, Jan 16, 2025 at 6:47 PM Nhat Pham <nphamcs@...il.com> wrote:
>
> On Fri, Jan 17, 2025 at 1:48 AM Yosry Ahmed <yosryahmed@...gle.com> wrote:
> >
> > On Thu, Jan 16, 2025 at 1:22 AM Nhat Pham <nphamcs@...il.com> wrote:
> > >
> > > My apologies if I missed any interested party in the cc list -
> > > hopefully the mailing lists cc's suffice :)
> > >
> > > I'd like to (re-)propose the topic of swap abstraction layer for the
> > > conference, as a continuation of Yosry's proposals at LSFMMBPF 2023
> > > (see [1], [2], [3]).
> > >
> > > (AFAICT, the same idea has been floated by Rik van Riel since at
> > > least 2011 - see [8]).
> > >
> > > I have a working(-ish) prototype, which hopefully will be
> > > submission-ready soon. For now, I'd like to give the motivation/context
> > > for the topic, as well as some high level design:
> >
> > I would obviously be interested in attending this, albeit virtually if
> > possible. Just sharing some random thoughts below from my cold cache.
>
> Your inputs are always appreciated :)
>
> >
> > >
> > > I. Motivation
> > >
> > > Currently, when an anon page is swapped out, a slot in a backing swap
> > > device is allocated and stored in the page table entries that refer to
> > > the original page. This slot is also used as the "key" to find the
> > > swapped out content, as well as the index to swap data structures, such
> > > as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> > > backing slot in this way is performant and efficient when swap is purely
> > > just disk space, and swapoff is rare.
> > >
> > > However, the advent of many swap optimizations has exposed major
> > > drawbacks of this design. The first problem is that we occupy a physical
> > > slot in the swap space, even for pages that are NEVER expected to hit
> > > the disk: pages compressed and stored in the zswap pool, zero-filled
> > > pages, or pages rejected by both of these optimizations when zswap
> > > writeback is disabled. This is the arguably central shortcoming of
> > > zswap:
> > > * In deployments when no disk space can be afforded for swap (such as
> > > mobile and embedded devices), users cannot adopt zswap, and are forced
> > > to use zram. This is confusing for users, and creates extra burdens
> > > for developers, having to develop and maintain similar features for
> > > two separate swap backends (writeback, cgroup charging, THP support,
> > > etc.). For instance, see the discussion in [4].
> > > * Resource-wise, it is hugely wasteful in terms of disk usage, and
> > > limits the memory saving potentials of these optimizations by the
> > > static size of the swapfile, especially in high memory systems that
> > > can have up to terabytes worth of memory. It also creates significant
> > > challenges for users who rely on swap utilization as an early OOM
> > > signal.
> > >
> > > Another motivation for a swap redesign is to simplify swapoff, which
> > > is complicated and expensive in the current design. Tight coupling
> > > between a swap entry and its backing storage means that it requires a
> > > whole page table walk to update all the page table entries that refer to
> > > this swap entry, as well as updating all the associated swap data
> > > structures (swap cache, etc.).
> > >
> > >
> > > II. High Level Design Overview
> > >
> > > To fix the aforementioned issues, we need an abstraction that separates
> > > a swap entry from its physical backing storage. IOW, we need to
> > > “virtualize” the swap space: swap clients will work with a virtual swap
> > > slot (that is dynamically allocated on-demand), storing it in page
> > > table entries, and using it to index into various swap-related data
> > > structures.
> > >
> > > The backing storage is decoupled from this slot, and the newly
> > > introduced layer will “resolve” the ID to the actual storage, as well
> > > as cooperating with the swap cache to handle all the required
> > > synchronization. This layer also manages other metadata of the swap
> > > entry, such as its lifetime information (swap count), via a dynamically
> > > allocated per-entry swap descriptor:
> >
> > Do you plan to allocate one per-folio or per-page? I suppose it's
> > per-page based on the design, but I am wondering if you explored
> > having it per-folio. To make it work we'd need to support splitting a
> > swp_desc, and figuring out which slot or zswap_entry corresponds to a
> > certain page in a folio
>
> Per-page, for now. Per-folio requires allocating these swp_descs on
> huge page splitting etc., which is more complex.
We'd also need to allocate them during swapin. If a folio is swapped
out as a 16K chunk with a single swp_desc, then we try to swapin one
4K in the middle, we may need to split the swp_desc into 2.
>
> And yeah, we need to chain these zswap_entry's somehow. Not impossible
> certainly, but more overhead and more complexity :)
>
> >
> > >
> > > struct swp_desc {
> > > swp_entry_t vswap;
> > > union {
> > > swp_slot_t slot;
> > > struct folio *folio;
> > > struct zswap_entry *zswap_entry;
> > > };
> > > struct rcu_head rcu;
> > >
> > > rwlock_t lock;
> > > enum swap_type type;
> > >
> > > #ifdef CONFIG_MEMCG
> > > atomic_t memcgid;
> > > #endif
> > >
> > > atomic_t in_swapcache;
> > > struct kref refcnt;
> > > atomic_t swap_count;
> > > };
> >
> > That seems a bit large. I am assuming this is for the purpose of the
> > prototype and we can reduce its size eventually, right?
>
> Yup. I copied and pasted this from the prototype. Originally I
> squeezed all the state (in_swapcache and the swap type) in an
> integer-type "flag" field + 1 separate swap count field, and protected
> them all with a single rw lock. That gets really ugly/confusing, so
> for the sake of the prototype I just separate them all out in their
> own fields, and play with atomicity to see if it's possible to do
> things locklessly. So far so good (i.e no crashes yet), but the final
> form is TBD :) Maybe we can discuss in closer details once I send out
> this prototype as an RFC?
Yeah, I just had some passing comments.
>
> (I will say though it looks cleaner when all these fields are
> separated. So it's going to be a tradeoff in that sense too).
It's a tradeoff but I think we should be able to hide a lot of the
complexity behind neat helpers. It's not pretty but I think the memory
overhead is an important factor here.
>
> >
> > Particularly, I remember looking into merging the swap_count and
> > refcnt, and I am not sure what in_swapcache is (is this a bit? Why
> > can't we use a bit from swap_count?).
>
> Yup. That's a single bit - it's a (partial) replacement for
> SWAP_HAS_CACHE state in the existing swap map.
>
> No particular reason why we can't squeeze it into swap counts other
> than clarity :) It's going to be a bit annoying working with swap
> count values (swap count increment is now * 2 instead of ++ etc.).
Nothing a nice helper cannot hide :)
Powered by blists - more mailing lists