lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250716202006.3640584-1-youngjun.park@lge.com>
Date: Thu, 17 Jul 2025 05:20:02 +0900
From: Youngjun Park <youngjun.park@....com>
To: akpm@...ux-foundation.org,
	hannes@...xchg.org
Cc: mhocko@...nel.org,
	roman.gushchin@...ux.dev,
	shakeel.butt@...ux.dev,
	muchun.song@...ux.dev,
	shikemeng@...weicloud.com,
	kasong@...cent.com,
	nphamcs@...il.com,
	bhe@...hat.com,
	baohua@...nel.org,
	chrisl@...nel.org,
	cgroups@...r.kernel.org,
	linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	gunho.lee@....com,
	iamjoonsoo.kim@....com,
	taejoon.song@....com,
	Youngjun Park <youngjun.park@....com>
Subject: [PATCH 0/4] mm/swap, memcg: Support per-cgroup swap device priorities

This patchset introduces a mechanism to assign swap device priorities
per cgroup.

It is an evolution of a previously submitted RFC [1], with revised
semantics, interfaces, and implementation based on community feedback.

======================================================================
I. MOTIVATION
======================================================================

Core requirement was to improve application responsiveness and loading
time, especially for latency-critical applications, without increasing
RAM or storage hardware resources.

Device constraints:
  - Linux-based embedded platform
  - Limited system RAM
  - Small local swap
  - No option to expand RAM or local swap

To mitigate this, we explored utilizing idle RAM and storage from nearby
devices as remote swap space. To maximize its effectiveness, we needed
per-cgroup control over swap device selection:

  - Assign faster local swap devices to latency-critical apps
  - Assign remote swap devices to background apps

However, current kernel swap infrastructure does not support per-cgroup
swap device assignment.

======================================================================
II. EVALUATED ALTERNATIVES
======================================================================

**II-1. Per-cgroup Dedicated Swap Devices**

- Proposed upstream [2]
- Difficult to maintain consistent global vs per-cgroup swap state
- Hard to reconcile with memory.max and swap.max semantics

**II-2. Multi-backend Swap Device with Cgroup-aware Routing**

- Breaks layering abstraction (block device cgroup awareness)
- Swap devices treated as physical storage
- Related ideas discussed in [3]

**II-3. Per-cgroup Swap Enable/Disable with Usage Control**

- Could expand swap.max via zswap writeback [4]
- Cannot express flexible device orderings
- Less expressive than per-device priorities

**Conclusion:** Per-cgroup swap priority configuration is the most natural and
least invasive extension to existing kernel mechanisms.

======================================================================
III. DESIGN OVERVIEW
======================================================================

**III-1. Per-Cgroup Swap Priority**

Semantics:
- Configure swap priorities per device via the `memory.swap.priority` interface.
- If a value is specified, it overrides the global priority for that cgroup.
- Priority semantics follow the global swap behavior:
  - Higher numeric values are preferred
  - Devices with equal priority are used round-robin
  - Negative priorities follow NUMA-aware fallback [5]
- If no value is given, the global swap priority is used.
- Default settings influence swap device propagation on swapon/swapoff events.
- At `swapon`, these settings determine whether and how newly added devices
  are included for the cgroup.

Each cgroup exposes a readable and writable file:

  memory.swap.priority

This file accepts one `<id> <priority>` pair per line, where `<id>` is the
numeric ID of a swap device as shown in `/proc/swaps`:

  Filename       Type        Size   Used  Priority  Id
  /dev/sda2      partition   ...    ...   20        1
  /dev/sdb2      partition   ...    ...   -2        2

The following defaults can be set:

- `default none`:
  Use global priority (implicit default)

- `default disabled`:
  Exclude swap devices from use in this cgroup

These defaults determine how new devices are handled at `swapon` time.

Special keywords can also be specified per device:
- `<id> none`: use global priority (clears override)
- `<id> disabled`: exclude the device from this cgroup's swap allocation

Reading this file shows the current configuration. Devices not explicitly set
may still appear if their effective priority differs from the global value due
to NUMA fallback or internal normalization.

**Example**

  echo "1 -2" > memory.swap.priority

May result in:

  1 -2
  2 -3

To revert both devices to global priority:

  echo "1 none" > memory.swap.priority
  echo "2 none" > memory.swap.priority

To disable device 1 while allowing device 2:

  echo "1 disabled" > memory.swap.priority

**III-2. Inheritance**

Inheritance semantics:

- Each cgroup inherits from the **highest** ancestor with a setting
- Intermediate ancestors are ignored
- If no ancestor is configured, the local setting is used
- If the inherited ancestor configuration is removed or absent, the cgroup
  falls back to its local setting. If none exists, the global priority is used.

The effective configuration after inheritance is visible via:

  memory.swap.priority.effective

If `default disabled` is active, it is shown explicitly.  
If `default none` is used, it is applied silently and not shown.

======================================================================
IV. TESTING
======================================================================

This patchset was tested on x86_64 under QEMU using `stress-ng` to generate
swap I/O while toggling swap devices and updating `memory.swap.priority`.

The kernel was instrumented with KASAN, lockdep, and other
`CONFIG_DEBUG_*` options to increase debugging coverage and help identify
potential issues under stress.

======================================================================
V. CHANGE HISTORY
======================================================================

== RFC → v1 ==

[1] Changed interface from flat `1:10,2:-1` to line-based flat key format,
    following cgroup v2 interface conventions where each swap device is
    configured independently.
    - Suggested by: Michal Koutný

[2] Added `memory.swap.priority.effective` to expose the final applied
    priority, reflecting cgroup inheritance rules.

[3] Clarified default semantics: `default none`, `default disabled`
    - Suggested by: Michal Koutný

[4] Implemented per-cgroup percpu swap device cache and used per-device
    shared clusters to avoid scalability issues
    - Suggested by: Kairui Song

[5] Exposed swap device id via /proc/swaps for introspection

[6] Introduced swap_cgroup_priority.h to define the main interface and declare
    symbols shared with swapfile.c.

[7] Aligned the number of swap_cgroup_priority_pnode instances with nr_swapfiles
    to ensure consistency during swap device changes.

[8] Removed the explicit delete interface, now handled implicitly by dynamic tracking.

======================================================================
VI. REFERENCES
======================================================================

[1] RFC: Per-cgroup swap device prioritization  
    https://lore.kernel.org/linux-mm/aEvLjEInMQC7hEyh@yjaykim-PowerEdge-T330/T/#mbbb6a5e9e30843097e1f5f65fb98f31d582b973d  
[2] Cgroup-specific swap devices (2014)  
    https://lkml.iu.edu/hypermail/linux/kernel/1404.0/02530.html  
[3] Swap redirection and zswap writeback discussions  
    https://lore.kernel.org/linux-mm/CAMgjq7DGMS5A4t6nOQmwyLy5Px96aoejBkiwFHgy9uMk-F8Y-w@mail.gmail.com  
[4] Per-cgroup zswap writeback  
    https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com  
[5] Swap NUMA fallback  
    https://docs.kernel.org/vm/swap_numa.html
---

This feature is marked **EXPERIMENTAL** in Kconfig, as it has not yet undergone
extensive real-world testing. The implementation is functional and reflects
feedback from prior RFC discussions, but further testing and review are welcome.
I’m happy to iterate based on community feedback.

Thanks,
Youngjun Park

Youngjun Park (4):
  mm/swap, memcg: Introduce infrastructure for cgroup-based swap
    priority
  mm: swap: Apply per-cgroup swap priority mechanism to swap layer
  mm: memcg: Add swap cgroup priority inheritance mechanism
  mm: swap: Per-cgroup per-CPU swap device cache with shared clusters

 Documentation/admin-guide/cgroup-v2.rst |   76 ++
 MAINTAINERS                             |    2 +
 include/linux/memcontrol.h              |    3 +
 include/linux/swap.h                    |   10 +
 mm/Kconfig                              |   14 +
 mm/Makefile                             |    1 +
 mm/memcontrol.c                         |  105 ++-
 mm/swap_cgroup_priority.c               | 1036 +++++++++++++++++++++++
 mm/swap_cgroup_priority.h               |  128 +++
 mm/swapfile.c                           |  108 ++-
 10 files changed, 1456 insertions(+), 27 deletions(-)
 create mode 100644 mm/swap_cgroup_priority.c
 create mode 100644 mm/swap_cgroup_priority.h

base-commit: 347e9f5043c89695b01e66b3ed111755afcf1911
-- 
2.34.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ