lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20260131125454.3187546-1-youngjun.park@lge.com>
Date: Sat, 31 Jan 2026 21:54:49 +0900
From: Youngjun Park <youngjun.park@....com>
To: akpm@...ux-foundation.org
Cc: chrisl@...nel.org,
	kasong@...cent.com,
	hannes@...xchg.org,
	mhocko@...nel.org,
	roman.gushchin@...ux.dev,
	shakeel.butt@...ux.dev,
	muchun.song@...ux.dev,
	shikemeng@...weicloud.com,
	nphamcs@...il.com,
	bhe@...hat.com,
	baohua@...nel.org,
	cgroups@...r.kernel.org,
	linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	gunho.lee@....com,
	youngjun.park@....com,
	taejoon.song@....com
Subject: [RFC PATCH v3 0/5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control

This is the third version of the RFC for the "Swap Tiers" concept,
incorporating LPC 2025 feedback and subsequent bug fixes.

Previous approach: https://lore.kernel.org/linux-mm/20250716202006.3640584-1-youngjun.park@lge.com/
RFC v2: https://lore.kernel.org/linux-mm/20260126065242.1221862-1-youngjun.park@lge.com/
RFC v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/

v3 addresses bug fixes found during testing and adds clarifications to
improve patch reviewability.

Overview (Recap)
================
Swap Tiers enable cgroup-based swap device assignment by grouping swap
devices into named tiers. This allows faster devices (e.g., SSD) to be
dedicated to latency-sensitive workloads while slower devices (e.g., HDD,
network) serve background tasks. The concept was suggested by Chris Li.

Key Changes after LPC 2025(RFC v1)
==================================
The most significant change in v2 was adopting strict cgroup hierarchy
semantics based on LPC 2025 feedback. 

v1 allowed children to explicitly select tiers ("+tier") regardless of
parent configuration, violating standard cgroup principles.

v2 enforces proper hierarchy: child configurations are always subsets of
parent. Default is all tiers enabled; use "-tier" to exclude.

Example:
  Global: SSD, HDD, NET
  Parent: -HDD → uses SSD, NET
  Child: -SSD → uses NET (intersection)

  If SSD deleted: Child uses NET (exclusions reset)
  If NEW added: All cgroups use it by default

This ensures children cannot access resources denied by ancestors,
matching standard cgroup behavior.

For detailed rationale, see v2 RFC and LPC presentation.

Changes in RFC v3
=================
- Fixed swap_alloc_fast() tier eligibility check
- Fixed tier_mask restoration on error paths  
- Fixed priority -1 tier deletion bug
- Fixed !CONFIG_MEMCG build failures
- Improved commit messages
- Fix improper error handling
- Fixed coding style violations
- Fixed tier deletion propagation to cgroups

Changes in RFC v2
=================
- Strict cgroup hierarchy compliance (LPC 2025 feedback)
- Percpu swap device cache to preserve fastpath performance (Kairui Song, Baoquan He)
- Simplified tier structure (Chris Li)
- Removed explicit "+" selection; default is all tiers, use "-" to exclude  (Chris Li)
- Removed CONFIG_SWAP_TIER; now base kernel feature (Chris Li)
- Effective tier calculation moved to configuration time (swap.tiers write)
- Mixed operation support for "+" and "-" in /sys/kernel/mm/swap/tiers (Chris Li)
- Commit reorganization for clarity (Chris Li)
- Added tier priority modification support
- Added documentation for swap tiers concept and usage (Chris Li)

Real-world Results
==================
App preloading on our internal platform using NBD as separate tier.
(Our first real-world use case. We plan to refine and expand this usage.)

Without separate swap tier,
- Cannot selectively avoid default flash swap, unable to reduce flash wear and lifespan issues.
- Can't selectively assign NBD to specific apps that need it.

Result (cold launch vs. preloaded):
- Streaming App A: 13.17s → 4.18s (68% faster)
- Streaming App B: 5.60s → 1.12s (80% faster)  
- E-commerce App C: 10.25s → 2.00s (80% faster)

Performance validation against baseline (no tiers configured) shows
negligible overhead (<1%) in kernel build and vm-scalability benchmarks.
Detailed results in v2 cover letter.

Any feedback welcome.
Youngjun Park

Youngjun Park (5):
  mm: swap: introduce swap tier infrastructure
  mm: swap: associate swap devices with tiers
  mm: memcontrol: add interface for swap tier selection
  mm, swap: change back to use each swap device's percpu cluster
  mm, swap: introduce percpu swap device cache to avoid fragmentation

 Documentation/admin-guide/cgroup-v2.rst |  27 ++
 Documentation/mm/swap-tier.rst          | 109 ++++++
 MAINTAINERS                             |   2 +
 include/linux/memcontrol.h              |   3 +-
 include/linux/swap.h                    |  17 +-
 mm/Makefile                             |   2 +-
 mm/memcontrol.c                         |  85 +++++
 mm/swap.h                               |   4 +
 mm/swap_state.c                         |  72 ++++
 mm/swap_tier.c                          | 469 ++++++++++++++++++++++++
 mm/swap_tier.h                          |  84 +++++
 mm/swapfile.c                           | 133 +++----
 12 files changed, 938 insertions(+), 69 deletions(-)
 create mode 100644 Documentation/mm/swap-tier.rst
 create mode 100644 mm/swap_tier.c
 create mode 100644 mm/swap_tier.h

base-commit: 5a3704ed2dce0b54a7f038b765bb752b87ee8cc2
-- 
2.34.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ