[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20260131125454.3187546-1-youngjun.park@lge.com>
Date: Sat, 31 Jan 2026 21:54:49 +0900
From: Youngjun Park <youngjun.park@....com>
To: akpm@...ux-foundation.org
Cc: chrisl@...nel.org,
kasong@...cent.com,
hannes@...xchg.org,
mhocko@...nel.org,
roman.gushchin@...ux.dev,
shakeel.butt@...ux.dev,
muchun.song@...ux.dev,
shikemeng@...weicloud.com,
nphamcs@...il.com,
bhe@...hat.com,
baohua@...nel.org,
cgroups@...r.kernel.org,
linux-mm@...ck.org,
linux-kernel@...r.kernel.org,
gunho.lee@....com,
youngjun.park@....com,
taejoon.song@....com
Subject: [RFC PATCH v3 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
This is the third version of the RFC for the "Swap Tiers" concept,
incorporating LPC 2025 feedback and subsequent bug fixes.
Previous approach: https://lore.kernel.org/linux-mm/20250716202006.3640584-1-youngjun.park@lge.com/
RFC v2: https://lore.kernel.org/linux-mm/20260126065242.1221862-1-youngjun.park@lge.com/
RFC v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/
v3 addresses bug fixes found during testing and adds clarifications to
improve patch reviewability.
Overview (Recap)
================
Swap Tiers enable cgroup-based swap device assignment by grouping swap
devices into named tiers. This allows faster devices (e.g., SSD) to be
dedicated to latency-sensitive workloads while slower devices (e.g., HDD,
network) serve background tasks. The concept was suggested by Chris Li.
Key Changes after LPC 2025(RFC v1)
==================================
The most significant change in v2 was adopting strict cgroup hierarchy
semantics based on LPC 2025 feedback.
v1 allowed children to explicitly select tiers ("+tier") regardless of
parent configuration, violating standard cgroup principles.
v2 enforces proper hierarchy: child configurations are always subsets of
parent. Default is all tiers enabled; use "-tier" to exclude.
Example:
Global: SSD, HDD, NET
Parent: -HDD → uses SSD, NET
Child: -SSD → uses NET (intersection)
If SSD deleted: Child uses NET (exclusions reset)
If NEW added: All cgroups use it by default
This ensures children cannot access resources denied by ancestors,
matching standard cgroup behavior.
For detailed rationale, see v2 RFC and LPC presentation.
Changes in RFC v3
=================
- Fixed swap_alloc_fast() tier eligibility check
- Fixed tier_mask restoration on error paths
- Fixed priority -1 tier deletion bug
- Fixed !CONFIG_MEMCG build failures
- Improved commit messages
- Fix improper error handling
- Fixed coding style violations
- Fixed tier deletion propagation to cgroups
Changes in RFC v2
=================
- Strict cgroup hierarchy compliance (LPC 2025 feedback)
- Percpu swap device cache to preserve fastpath performance (Kairui Song, Baoquan He)
- Simplified tier structure (Chris Li)
- Removed explicit "+" selection; default is all tiers, use "-" to exclude (Chris Li)
- Removed CONFIG_SWAP_TIER; now base kernel feature (Chris Li)
- Effective tier calculation moved to configuration time (swap.tiers write)
- Mixed operation support for "+" and "-" in /sys/kernel/mm/swap/tiers (Chris Li)
- Commit reorganization for clarity (Chris Li)
- Added tier priority modification support
- Added documentation for swap tiers concept and usage (Chris Li)
Real-world Results
==================
App preloading on our internal platform using NBD as separate tier.
(Our first real-world use case. We plan to refine and expand this usage.)
Without separate swap tier,
- Cannot selectively avoid default flash swap, unable to reduce flash wear and lifespan issues.
- Can't selectively assign NBD to specific apps that need it.
Result (cold launch vs. preloaded):
- Streaming App A: 13.17s → 4.18s (68% faster)
- Streaming App B: 5.60s → 1.12s (80% faster)
- E-commerce App C: 10.25s → 2.00s (80% faster)
Performance validation against baseline (no tiers configured) shows
negligible overhead (<1%) in kernel build and vm-scalability benchmarks.
Detailed results in v2 cover letter.
Any feedback welcome.
Youngjun Park
Youngjun Park (5):
mm: swap: introduce swap tier infrastructure
mm: swap: associate swap devices with tiers
mm: memcontrol: add interface for swap tier selection
mm, swap: change back to use each swap device's percpu cluster
mm, swap: introduce percpu swap device cache to avoid fragmentation
Documentation/admin-guide/cgroup-v2.rst | 27 ++
Documentation/mm/swap-tier.rst | 109 ++++++
MAINTAINERS | 2 +
include/linux/memcontrol.h | 3 +-
include/linux/swap.h | 17 +-
mm/Makefile | 2 +-
mm/memcontrol.c | 85 +++++
mm/swap.h | 4 +
mm/swap_state.c | 72 ++++
mm/swap_tier.c | 469 ++++++++++++++++++++++++
mm/swap_tier.h | 84 +++++
mm/swapfile.c | 133 +++----
12 files changed, 938 insertions(+), 69 deletions(-)
create mode 100644 Documentation/mm/swap-tier.rst
create mode 100644 mm/swap_tier.c
create mode 100644 mm/swap_tier.h
base-commit: 5a3704ed2dce0b54a7f038b765bb752b87ee8cc2
--
2.34.1
Powered by blists - more mailing lists