[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250716202006.3640584-1-youngjun.park@lge.com>
Date: Thu, 17 Jul 2025 05:20:02 +0900
From: Youngjun Park <youngjun.park@....com>
To: akpm@...ux-foundation.org,
hannes@...xchg.org
Cc: mhocko@...nel.org,
roman.gushchin@...ux.dev,
shakeel.butt@...ux.dev,
muchun.song@...ux.dev,
shikemeng@...weicloud.com,
kasong@...cent.com,
nphamcs@...il.com,
bhe@...hat.com,
baohua@...nel.org,
chrisl@...nel.org,
cgroups@...r.kernel.org,
linux-mm@...ck.org,
linux-kernel@...r.kernel.org,
gunho.lee@....com,
iamjoonsoo.kim@....com,
taejoon.song@....com,
Youngjun Park <youngjun.park@....com>
Subject: [PATCH 0/4] mm/swap, memcg: Support per-cgroup swap device priorities
This patchset introduces a mechanism to assign swap device priorities
per cgroup.
It is an evolution of a previously submitted RFC [1], with revised
semantics, interfaces, and implementation based on community feedback.
======================================================================
I. MOTIVATION
======================================================================
Core requirement was to improve application responsiveness and loading
time, especially for latency-critical applications, without increasing
RAM or storage hardware resources.
Device constraints:
- Linux-based embedded platform
- Limited system RAM
- Small local swap
- No option to expand RAM or local swap
To mitigate this, we explored utilizing idle RAM and storage from nearby
devices as remote swap space. To maximize its effectiveness, we needed
per-cgroup control over swap device selection:
- Assign faster local swap devices to latency-critical apps
- Assign remote swap devices to background apps
However, current kernel swap infrastructure does not support per-cgroup
swap device assignment.
======================================================================
II. EVALUATED ALTERNATIVES
======================================================================
**II-1. Per-cgroup Dedicated Swap Devices**
- Proposed upstream [2]
- Difficult to maintain consistent global vs per-cgroup swap state
- Hard to reconcile with memory.max and swap.max semantics
**II-2. Multi-backend Swap Device with Cgroup-aware Routing**
- Breaks layering abstraction (block device cgroup awareness)
- Swap devices treated as physical storage
- Related ideas discussed in [3]
**II-3. Per-cgroup Swap Enable/Disable with Usage Control**
- Could expand swap.max via zswap writeback [4]
- Cannot express flexible device orderings
- Less expressive than per-device priorities
**Conclusion:** Per-cgroup swap priority configuration is the most natural and
least invasive extension to existing kernel mechanisms.
======================================================================
III. DESIGN OVERVIEW
======================================================================
**III-1. Per-Cgroup Swap Priority**
Semantics:
- Configure swap priorities per device via the `memory.swap.priority` interface.
- If a value is specified, it overrides the global priority for that cgroup.
- Priority semantics follow the global swap behavior:
- Higher numeric values are preferred
- Devices with equal priority are used round-robin
- Negative priorities follow NUMA-aware fallback [5]
- If no value is given, the global swap priority is used.
- Default settings influence swap device propagation on swapon/swapoff events.
- At `swapon`, these settings determine whether and how newly added devices
are included for the cgroup.
Each cgroup exposes a readable and writable file:
memory.swap.priority
This file accepts one `<id> <priority>` pair per line, where `<id>` is the
numeric ID of a swap device as shown in `/proc/swaps`:
Filename Type Size Used Priority Id
/dev/sda2 partition ... ... 20 1
/dev/sdb2 partition ... ... -2 2
The following defaults can be set:
- `default none`:
Use global priority (implicit default)
- `default disabled`:
Exclude swap devices from use in this cgroup
These defaults determine how new devices are handled at `swapon` time.
Special keywords can also be specified per device:
- `<id> none`: use global priority (clears override)
- `<id> disabled`: exclude the device from this cgroup's swap allocation
Reading this file shows the current configuration. Devices not explicitly set
may still appear if their effective priority differs from the global value due
to NUMA fallback or internal normalization.
**Example**
echo "1 -2" > memory.swap.priority
May result in:
1 -2
2 -3
To revert both devices to global priority:
echo "1 none" > memory.swap.priority
echo "2 none" > memory.swap.priority
To disable device 1 while allowing device 2:
echo "1 disabled" > memory.swap.priority
**III-2. Inheritance**
Inheritance semantics:
- Each cgroup inherits from the **highest** ancestor with a setting
- Intermediate ancestors are ignored
- If no ancestor is configured, the local setting is used
- If the inherited ancestor configuration is removed or absent, the cgroup
falls back to its local setting. If none exists, the global priority is used.
The effective configuration after inheritance is visible via:
memory.swap.priority.effective
If `default disabled` is active, it is shown explicitly.
If `default none` is used, it is applied silently and not shown.
======================================================================
IV. TESTING
======================================================================
This patchset was tested on x86_64 under QEMU using `stress-ng` to generate
swap I/O while toggling swap devices and updating `memory.swap.priority`.
The kernel was instrumented with KASAN, lockdep, and other
`CONFIG_DEBUG_*` options to increase debugging coverage and help identify
potential issues under stress.
======================================================================
V. CHANGE HISTORY
======================================================================
== RFC → v1 ==
[1] Changed interface from flat `1:10,2:-1` to line-based flat key format,
following cgroup v2 interface conventions where each swap device is
configured independently.
- Suggested by: Michal Koutný
[2] Added `memory.swap.priority.effective` to expose the final applied
priority, reflecting cgroup inheritance rules.
[3] Clarified default semantics: `default none`, `default disabled`
- Suggested by: Michal Koutný
[4] Implemented per-cgroup percpu swap device cache and used per-device
shared clusters to avoid scalability issues
- Suggested by: Kairui Song
[5] Exposed swap device id via /proc/swaps for introspection
[6] Introduced swap_cgroup_priority.h to define the main interface and declare
symbols shared with swapfile.c.
[7] Aligned the number of swap_cgroup_priority_pnode instances with nr_swapfiles
to ensure consistency during swap device changes.
[8] Removed the explicit delete interface, now handled implicitly by dynamic tracking.
======================================================================
VI. REFERENCES
======================================================================
[1] RFC: Per-cgroup swap device prioritization
https://lore.kernel.org/linux-mm/aEvLjEInMQC7hEyh@yjaykim-PowerEdge-T330/T/#mbbb6a5e9e30843097e1f5f65fb98f31d582b973d
[2] Cgroup-specific swap devices (2014)
https://lkml.iu.edu/hypermail/linux/kernel/1404.0/02530.html
[3] Swap redirection and zswap writeback discussions
https://lore.kernel.org/linux-mm/CAMgjq7DGMS5A4t6nOQmwyLy5Px96aoejBkiwFHgy9uMk-F8Y-w@mail.gmail.com
[4] Per-cgroup zswap writeback
https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com
[5] Swap NUMA fallback
https://docs.kernel.org/vm/swap_numa.html
---
This feature is marked **EXPERIMENTAL** in Kconfig, as it has not yet undergone
extensive real-world testing. The implementation is functional and reflects
feedback from prior RFC discussions, but further testing and review are welcome.
I’m happy to iterate based on community feedback.
Thanks,
Youngjun Park
Youngjun Park (4):
mm/swap, memcg: Introduce infrastructure for cgroup-based swap
priority
mm: swap: Apply per-cgroup swap priority mechanism to swap layer
mm: memcg: Add swap cgroup priority inheritance mechanism
mm: swap: Per-cgroup per-CPU swap device cache with shared clusters
Documentation/admin-guide/cgroup-v2.rst | 76 ++
MAINTAINERS | 2 +
include/linux/memcontrol.h | 3 +
include/linux/swap.h | 10 +
mm/Kconfig | 14 +
mm/Makefile | 1 +
mm/memcontrol.c | 105 ++-
mm/swap_cgroup_priority.c | 1036 +++++++++++++++++++++++
mm/swap_cgroup_priority.h | 128 +++
mm/swapfile.c | 108 ++-
10 files changed, 1456 insertions(+), 27 deletions(-)
create mode 100644 mm/swap_cgroup_priority.c
create mode 100644 mm/swap_cgroup_priority.h
base-commit: 347e9f5043c89695b01e66b3ed111755afcf1911
--
2.34.1
Powered by blists - more mailing lists