linux-kernel - [PATCH 00/28] mm, swap: introduce swap table

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250514201729.48420-1-ryncsn@gmail.com>
Date: Thu, 15 May 2025 04:17:00 +0800
From: Kairui Song <ryncsn@...il.com>
To: linux-mm@...ck.org
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	Matthew Wilcox <willy@...radead.org>,
	Hugh Dickins <hughd@...gle.com>,
	Chris Li <chrisl@...nel.org>,
	David Hildenbrand <david@...hat.com>,
	Yosry Ahmed <yosryahmed@...gle.com>,
	"Huang, Ying" <ying.huang@...ux.alibaba.com>,
	Nhat Pham <nphamcs@...il.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Baolin Wang <baolin.wang@...ux.alibaba.com>,
	Baoquan He <bhe@...hat.com>,
	Barry Song <baohua@...nel.org>,
	Kalesh Singh <kaleshsingh@...gle.com>,
	Kemeng Shi <shikemeng@...weicloud.com>,
	Tim Chen <tim.c.chen@...ux.intel.com>,
	Ryan Roberts <ryan.roberts@....com>,
	linux-kernel@...r.kernel.org,
	Kairui Song <kasong@...cent.com>
Subject: [PATCH 00/28] mm, swap: introduce swap table

From: Kairui Song <kasong@...cent.com>

This is the series that implements the Swap Table idea propose in the
LSF/MM/BPF topic "Integrate swap cache, swap maps with swap allocator"
about one month ago [1].

With this series, swap subsystem will have a ~20-30% performance gain
from basic sequential swap to heavy workloads, for both 4K and mTHP
folios. The idle memory usage is already much lower, the average memory
consumption is still the same or will also be even lower (with further
works). And this enables many more future optimizations, with better
defined swap operations.

This series is stable and mergeable on both mm-unstable and mm-stable.
It's a long series so it might be challenging to review though. It has
been working well with many stress tests.

You can also find the latest branch here:
https://github.com/ryncsn/linux/tree/kasong/devel/swap-table

With swap table, a table entry will be the fundamental and the only needed
data structure for: swap cache, swap map, swap cgroup map. Reducing the
memory usage and making the performance better, also provide more
flexibility and better abstraction.

/*
 * Swap table entry type and bit layouts:
 *
 * NULL:     | ------------    0   -------------|
 * Shadow:   | SWAP_COUNT |---- SHADOW_VAL ---|1|
 * Folio:    | SWAP_COUNT |------ PFN -------|10|
 * Pointer:  |----------- Pointer ----------|100|
 */

This series contains many clean up and refractors due to many historical
issues with the SWAP subsystem, e.g. the fuzzy workflow and definition of
how swap entries are handled, and a lot of corner issues as mentioned
in the LSF/MM/BPF talk.

There could be temporary increase of complicity or memory consumption
in the middle of this series, but in the end it's a much simplified
and sanitized in the end. And these patches have dependency on each
other due to the current complex swap design. So this takes a long
series.

This series cleaned up most of the issues and improved the situation
with following order:
- Simplification and optimizations (Patch 1 - 3)
- Tidy up swap info and cache lookup (Patch 4 - 6)
- Introduce basic swap table infrastructure (Patch 7 - 8)
- Removed swap cache bypassing with SWP_SYNCHRONOUS_IO, enabling mTHP
  for more workloads (Patch 9 - 14).
- Simplify swap in synchronization with swap cache, eliminating long
  tailing issues and improve performance, swap can be synced with folio
  lock now (Patch 15 - 16).
- Make most swap operations into folio based. We now can use folio based
  helpers that ensures the swap entries are stable with folio lock,
  which also make more optimization and sanity checks doable.
  (Patch 17 - 18)
- Removed SWAP_HAS_CACHE. (Patch 19 - 22)
- Completely rework the swap counting using swap table, and remove
  COUNT_CONTINUED (Patch 23 - 27).
- Dynamic reclaim and allocation for swap table (Patch 28)

And the performance is looking great too:

vm-scalability usemem shows a great improvement:
Test using: usemem --init-time -O -y -x -n 31 1G (1G memcg, pmem as swap)
                Before:         After:
System time:    217.39s         161.59s      (-25.67%)
Throughput:     3933.58 MB/s    4975.55 MB/s (+26.48%)
(Similar results with random usemem -R)

Build kernel with defconfig on tmpfs with ZRAM:
Below results shows a test matrix using different memory cgroup limit
and job numbers.

  make -j<NR>|  Total Sys Time (seconds) |   Total Time (seconds)
  (NR / Mem )|  (Before / After / Delta) | (Before / After / Delta)
  With 4k pages only:
   6 / 192M  |    5327 /  3915 / -26.5%  |    1427 /  1141 / -20.0%
  12 / 256M  |    5373 /  4009 / -25.3%  |     743 /   606 / -18.4%
  24 / 384M  |    6149 /  4523 / -26.4%  |     438 /   353 / -19.4%
  48 / 768M  |    7285 /  4521 / -37.9%  |     251 /   190 / -24.3%
  With 64k mTHP:
  24 / 512M  |    4399 /  3328 / -24.3%  |     345 /   289 / -16.2%
  48 /   1G  |    5072 /  3406 / -32.8%  |     187 /   150 / -19.7%

Memory usage is also reduced. Although this series haven't remove the
swap cgroup array yet, the peak usage of one swap entry is already
reduced from 12 bytes to 10 bytes. And the swap table is dynamically
allocated which means the idle memory usage will be reduced by a lot.

Some other high lights and notes:

1. This series introduce a set of helpers "folio_alloc_swap",
   "folio_dup_swap", "folio_put_swap", "folio_free_swap*" to make
   most swap operations folio based, this should brought a clean border
   between swap and rest of mm. Also split the hibernation swap entry
   allocation out of ordinary swap operations.

3. This series enabled mTHP swap-in, and read ahead skipping for more
   workloads, as it removed the swap cache bypassing path:

   We currently only do mTHP swap in and read ahead bypass for
   SWP_SYNCHRONOUS_IO device only when swap count of all related entries
   are equal to one. This makes no sense, clearly read ahead and mTHP
   behaviour should have nothing to do with swap count, it's only a defect
   due to current design that they are coupled with swap cache bypassing.
   This series removed that limitation while showing a major performance
   improvement.

   This not only showed a performance gain, also should reduce mTHP
   fragmentation.

4. By removing the old swap cache design, now all swap cache are
   protected by fine grained cluster lock, this also removed the cluster
   shuffle algorithm should improve the performance for SWAP on HDD too
   (Fixing [4]). And also got rid of the many swap address_space instance
   design.

5. I dropped some future doable optimizations for now, e.g. the folio
   based helper will be an essential part for dropping the swap
   cgroup control map, which will improve the performance and reduce
   memory usage even more. It could be done later. And more folio
   batched operations could be done based on this. So this series is
   not in the best shape but already looking good enough.

Future work items:

1. More tests, and maybe some of the patches need to be split into
   smaller ones or need a few preparation series.

2. Integrate with Nhat Pham's Virtual swap space [2], while this series
   improves the performance and added a sanitized workload for SWAP,
   nothing changed feature wise. The swap table idea is suppose to be
   able to handle things like a virtual device in a cleaner way with
   both lower overhead and better flexibility, more work is needed to
   figure out a way to implement it.

3. Some helpers from this series could be very helpful for future works.
   E.g. the folio based swap helpers, now locking a folio will stabilize
   its swap entries, which could also be used to stabilize the under layer
   swap device's entries if a virtual device design is implemented,
   hence simplify the locking design. Also more entry types could be
   added for things like zero map or shmem.

4. The unified swap in path now already enabled mTHP swap in for swap
   count > 1 entries. This also making unify the read ahead of shmem / anon
   (as demostrated a year ago [3], that one conflicted with the
   standalone mTHP swapin path but now it's unified) doable now.

   We can also implement a read ahead based mTHP swapin based on this too.
   This needs more discussion.

Link: https://lore.kernel.org/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com [1]
Link: https://lore.kernel.org/lkml/20250407234223.1059191-1-nphamcs@gmail.com/ [2]
Link: https://lore.kernel.org/all/20240129175423.1987-1-ryncsn@gmail.com/ [3]
Link: https://lore.kernel.org/linux-mm/202504241621.f27743ec-lkp@intel.com/ [4]

Kairui Song (27):
  mm, swap: don't scan every fragment cluster
  mm, swap: consolidate the helper for mincore
  mm, swap: split readahead update out of swap cache lookup
  mm, swap: sanitize swap cache lookup convention
  mm, swap: rearrange swap cluster definition and helpers
  mm, swap: tidy up swap device and cluster info helpers
  mm, swap: use swap table for the swap cache and switch API
  mm/swap: rename __read_swap_cache_async to __swapin_cache_alloc
  mm, swap: add a swap helper for bypassing only read ahead
  mm, swap: clean up and consolidate helper for mTHP swapin check
  mm, swap: never bypass the swap cache for SWP_SYNCHRONOUS_IO
  mm/shmem, swap: avoid redundant Xarray lookup during swapin
  mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
  mm, swap: split locked entry freeing into a standalone helper
  mm, swap: use swap cache as the swap in synchronize layer
  mm, swap: sanitize swap entry management workflow
  mm, swap: rename and introduce folio_free_swap_cache
  mm, swap: clean up and improve swap entries batch freeing
  mm, swap: check swap table directly for checking cache
  mm, swap: add folio to swap cache directly on allocation
  mm, swap: drop the SWAP_HAS_CACHE flag
  mm, swap: remove no longer needed _swap_info_get
  mm, swap: implement helpers for reserving data in swap table
  mm/workingset: leave highest 8 bits empty for anon shadow
  mm, swap: minor clean up for swapon
  mm, swap: use swap table to track swap count
  mm, swap: implement dynamic allocation of swap table

Nhat Pham (1):
  mm/shmem, swap: remove SWAP_MAP_SHMEM

 arch/s390/mm/pgtable.c |    2 +-
 include/linux/swap.h   |  119 +--
 kernel/power/swap.c    |    8 +-
 mm/filemap.c           |   20 +-
 mm/huge_memory.c       |   20 +-
 mm/madvise.c           |    2 +-
 mm/memory-failure.c    |    2 +-
 mm/memory.c            |  384 ++++-----
 mm/migrate.c           |   28 +-
 mm/mincore.c           |   49 +-
 mm/page_io.c           |   12 +-
 mm/rmap.c              |    7 +-
 mm/shmem.c             |  204 ++---
 mm/swap.h              |  316 ++++++--
 mm/swap_state.c        |  646 ++++++++-------
 mm/swap_table.h        |  231 ++++++
 mm/swapfile.c          | 1708 +++++++++++++++++-----------------------
 mm/userfaultfd.c       |    9 +-
 mm/vmscan.c            |   22 +-
 mm/workingset.c        |   39 +-
 mm/zswap.c             |   13 +-
 21 files changed, 1981 insertions(+), 1860 deletions(-)
 create mode 100644 mm/swap_table.h

-- 
2.49.0