[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250514201729.48420-1-ryncsn@gmail.com>
Date: Thu, 15 May 2025 04:17:00 +0800
From: Kairui Song <ryncsn@...il.com>
To: linux-mm@...ck.org
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Matthew Wilcox <willy@...radead.org>,
Hugh Dickins <hughd@...gle.com>,
Chris Li <chrisl@...nel.org>,
David Hildenbrand <david@...hat.com>,
Yosry Ahmed <yosryahmed@...gle.com>,
"Huang, Ying" <ying.huang@...ux.alibaba.com>,
Nhat Pham <nphamcs@...il.com>,
Johannes Weiner <hannes@...xchg.org>,
Baolin Wang <baolin.wang@...ux.alibaba.com>,
Baoquan He <bhe@...hat.com>,
Barry Song <baohua@...nel.org>,
Kalesh Singh <kaleshsingh@...gle.com>,
Kemeng Shi <shikemeng@...weicloud.com>,
Tim Chen <tim.c.chen@...ux.intel.com>,
Ryan Roberts <ryan.roberts@....com>,
linux-kernel@...r.kernel.org,
Kairui Song <kasong@...cent.com>
Subject: [PATCH 00/28] mm, swap: introduce swap table
From: Kairui Song <kasong@...cent.com>
This is the series that implements the Swap Table idea propose in the
LSF/MM/BPF topic "Integrate swap cache, swap maps with swap allocator"
about one month ago [1].
With this series, swap subsystem will have a ~20-30% performance gain
from basic sequential swap to heavy workloads, for both 4K and mTHP
folios. The idle memory usage is already much lower, the average memory
consumption is still the same or will also be even lower (with further
works). And this enables many more future optimizations, with better
defined swap operations.
This series is stable and mergeable on both mm-unstable and mm-stable.
It's a long series so it might be challenging to review though. It has
been working well with many stress tests.
You can also find the latest branch here:
https://github.com/ryncsn/linux/tree/kasong/devel/swap-table
With swap table, a table entry will be the fundamental and the only needed
data structure for: swap cache, swap map, swap cgroup map. Reducing the
memory usage and making the performance better, also provide more
flexibility and better abstraction.
/*
* Swap table entry type and bit layouts:
*
* NULL: | ------------ 0 -------------|
* Shadow: | SWAP_COUNT |---- SHADOW_VAL ---|1|
* Folio: | SWAP_COUNT |------ PFN -------|10|
* Pointer: |----------- Pointer ----------|100|
*/
This series contains many clean up and refractors due to many historical
issues with the SWAP subsystem, e.g. the fuzzy workflow and definition of
how swap entries are handled, and a lot of corner issues as mentioned
in the LSF/MM/BPF talk.
There could be temporary increase of complicity or memory consumption
in the middle of this series, but in the end it's a much simplified
and sanitized in the end. And these patches have dependency on each
other due to the current complex swap design. So this takes a long
series.
This series cleaned up most of the issues and improved the situation
with following order:
- Simplification and optimizations (Patch 1 - 3)
- Tidy up swap info and cache lookup (Patch 4 - 6)
- Introduce basic swap table infrastructure (Patch 7 - 8)
- Removed swap cache bypassing with SWP_SYNCHRONOUS_IO, enabling mTHP
for more workloads (Patch 9 - 14).
- Simplify swap in synchronization with swap cache, eliminating long
tailing issues and improve performance, swap can be synced with folio
lock now (Patch 15 - 16).
- Make most swap operations into folio based. We now can use folio based
helpers that ensures the swap entries are stable with folio lock,
which also make more optimization and sanity checks doable.
(Patch 17 - 18)
- Removed SWAP_HAS_CACHE. (Patch 19 - 22)
- Completely rework the swap counting using swap table, and remove
COUNT_CONTINUED (Patch 23 - 27).
- Dynamic reclaim and allocation for swap table (Patch 28)
And the performance is looking great too:
vm-scalability usemem shows a great improvement:
Test using: usemem --init-time -O -y -x -n 31 1G (1G memcg, pmem as swap)
Before: After:
System time: 217.39s 161.59s (-25.67%)
Throughput: 3933.58 MB/s 4975.55 MB/s (+26.48%)
(Similar results with random usemem -R)
Build kernel with defconfig on tmpfs with ZRAM:
Below results shows a test matrix using different memory cgroup limit
and job numbers.
make -j<NR>| Total Sys Time (seconds) | Total Time (seconds)
(NR / Mem )| (Before / After / Delta) | (Before / After / Delta)
With 4k pages only:
6 / 192M | 5327 / 3915 / -26.5% | 1427 / 1141 / -20.0%
12 / 256M | 5373 / 4009 / -25.3% | 743 / 606 / -18.4%
24 / 384M | 6149 / 4523 / -26.4% | 438 / 353 / -19.4%
48 / 768M | 7285 / 4521 / -37.9% | 251 / 190 / -24.3%
With 64k mTHP:
24 / 512M | 4399 / 3328 / -24.3% | 345 / 289 / -16.2%
48 / 1G | 5072 / 3406 / -32.8% | 187 / 150 / -19.7%
Memory usage is also reduced. Although this series haven't remove the
swap cgroup array yet, the peak usage of one swap entry is already
reduced from 12 bytes to 10 bytes. And the swap table is dynamically
allocated which means the idle memory usage will be reduced by a lot.
Some other high lights and notes:
1. This series introduce a set of helpers "folio_alloc_swap",
"folio_dup_swap", "folio_put_swap", "folio_free_swap*" to make
most swap operations folio based, this should brought a clean border
between swap and rest of mm. Also split the hibernation swap entry
allocation out of ordinary swap operations.
3. This series enabled mTHP swap-in, and read ahead skipping for more
workloads, as it removed the swap cache bypassing path:
We currently only do mTHP swap in and read ahead bypass for
SWP_SYNCHRONOUS_IO device only when swap count of all related entries
are equal to one. This makes no sense, clearly read ahead and mTHP
behaviour should have nothing to do with swap count, it's only a defect
due to current design that they are coupled with swap cache bypassing.
This series removed that limitation while showing a major performance
improvement.
This not only showed a performance gain, also should reduce mTHP
fragmentation.
4. By removing the old swap cache design, now all swap cache are
protected by fine grained cluster lock, this also removed the cluster
shuffle algorithm should improve the performance for SWAP on HDD too
(Fixing [4]). And also got rid of the many swap address_space instance
design.
5. I dropped some future doable optimizations for now, e.g. the folio
based helper will be an essential part for dropping the swap
cgroup control map, which will improve the performance and reduce
memory usage even more. It could be done later. And more folio
batched operations could be done based on this. So this series is
not in the best shape but already looking good enough.
Future work items:
1. More tests, and maybe some of the patches need to be split into
smaller ones or need a few preparation series.
2. Integrate with Nhat Pham's Virtual swap space [2], while this series
improves the performance and added a sanitized workload for SWAP,
nothing changed feature wise. The swap table idea is suppose to be
able to handle things like a virtual device in a cleaner way with
both lower overhead and better flexibility, more work is needed to
figure out a way to implement it.
3. Some helpers from this series could be very helpful for future works.
E.g. the folio based swap helpers, now locking a folio will stabilize
its swap entries, which could also be used to stabilize the under layer
swap device's entries if a virtual device design is implemented,
hence simplify the locking design. Also more entry types could be
added for things like zero map or shmem.
4. The unified swap in path now already enabled mTHP swap in for swap
count > 1 entries. This also making unify the read ahead of shmem / anon
(as demostrated a year ago [3], that one conflicted with the
standalone mTHP swapin path but now it's unified) doable now.
We can also implement a read ahead based mTHP swapin based on this too.
This needs more discussion.
Link: https://lore.kernel.org/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com [1]
Link: https://lore.kernel.org/lkml/20250407234223.1059191-1-nphamcs@gmail.com/ [2]
Link: https://lore.kernel.org/all/20240129175423.1987-1-ryncsn@gmail.com/ [3]
Link: https://lore.kernel.org/linux-mm/202504241621.f27743ec-lkp@intel.com/ [4]
Kairui Song (27):
mm, swap: don't scan every fragment cluster
mm, swap: consolidate the helper for mincore
mm, swap: split readahead update out of swap cache lookup
mm, swap: sanitize swap cache lookup convention
mm, swap: rearrange swap cluster definition and helpers
mm, swap: tidy up swap device and cluster info helpers
mm, swap: use swap table for the swap cache and switch API
mm/swap: rename __read_swap_cache_async to __swapin_cache_alloc
mm, swap: add a swap helper for bypassing only read ahead
mm, swap: clean up and consolidate helper for mTHP swapin check
mm, swap: never bypass the swap cache for SWP_SYNCHRONOUS_IO
mm/shmem, swap: avoid redundant Xarray lookup during swapin
mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
mm, swap: split locked entry freeing into a standalone helper
mm, swap: use swap cache as the swap in synchronize layer
mm, swap: sanitize swap entry management workflow
mm, swap: rename and introduce folio_free_swap_cache
mm, swap: clean up and improve swap entries batch freeing
mm, swap: check swap table directly for checking cache
mm, swap: add folio to swap cache directly on allocation
mm, swap: drop the SWAP_HAS_CACHE flag
mm, swap: remove no longer needed _swap_info_get
mm, swap: implement helpers for reserving data in swap table
mm/workingset: leave highest 8 bits empty for anon shadow
mm, swap: minor clean up for swapon
mm, swap: use swap table to track swap count
mm, swap: implement dynamic allocation of swap table
Nhat Pham (1):
mm/shmem, swap: remove SWAP_MAP_SHMEM
arch/s390/mm/pgtable.c | 2 +-
include/linux/swap.h | 119 +--
kernel/power/swap.c | 8 +-
mm/filemap.c | 20 +-
mm/huge_memory.c | 20 +-
mm/madvise.c | 2 +-
mm/memory-failure.c | 2 +-
mm/memory.c | 384 ++++-----
mm/migrate.c | 28 +-
mm/mincore.c | 49 +-
mm/page_io.c | 12 +-
mm/rmap.c | 7 +-
mm/shmem.c | 204 ++---
mm/swap.h | 316 ++++++--
mm/swap_state.c | 646 ++++++++-------
mm/swap_table.h | 231 ++++++
mm/swapfile.c | 1708 +++++++++++++++++-----------------------
mm/userfaultfd.c | 9 +-
mm/vmscan.c | 22 +-
mm/workingset.c | 39 +-
mm/zswap.c | 13 +-
21 files changed, 1981 insertions(+), 1860 deletions(-)
create mode 100644 mm/swap_table.h
--
2.49.0
Powered by blists - more mailing lists