lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251109124947.1101520-1-youngjun.park@lge.com>
Date: Sun,  9 Nov 2025 21:49:44 +0900
From: Youngjun Park <youngjun.park@....com>
To: akpm@...ux-foundation.org,
	linux-mm@...ck.org
Cc: cgroups@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	chrisl@...nel.org,
	kasong@...cent.com,
	hannes@...xchg.org,
	mhocko@...nel.org,
	roman.gushchin@...ux.dev,
	shakeel.butt@...ux.dev,
	muchun.song@...ux.dev,
	shikemeng@...weicloud.com,
	nphamcs@...il.com,
	bhe@...hat.com,
	baohua@...nel.org,
	youngjun.park@....com,
	gunho.lee@....com,
	taejoon.song@....com
Subject: [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control

Hi all,

In constrained environments, there is a need to improve workload
performance by controlling swap device usage on a per-process or
per-cgroup basis. For example, one might want to direct critical
processes to faster swap devices (like SSDs) while relegating
less critical ones to slower devices (like HDDs or Network Swap).

Initial approach was to introduce a per-cgroup swap priority
mechanism [1]. However, through review and discussion, several
drawbacks were identified:

a. There is a lack of concrete use cases for assigning a fine-grained,
   unique swap priority to each cgroup. 
b. The implementation complexity was high relative to the desired
   level of control.
c. Differing swap priorities between cgroups could lead to LRU
   inversion problems.

To address these concerns, I propose the "swap tiers" concept, 
originally suggested by Chris Li [2] and further developed through 
collaborative discussions. I would like to thank Chris Li and 
He Baoquan for their invaluable contributions in refining this 
approach, and Kairui Song, Nhat Pham, and Michal Koutný for their 
insightful reviews of earlier RFC versions.

Concept
-------
A swap tier is a grouping mechanism that assigns a "named id" to a
range of swap priorities. For example, all swap devices with a
priority of 100 or higher could be grouped into a tier named "SSD",
and all others into a tier named "HDD".

Cgroups can then select which named tiers they are permitted to use for
swapping via a new cgroup interface. This effectively restricts a
cgroup's swap activity to a specific subset of the available swap
devices.

Proposed Interface
------------------
1. Global Tier Definition: /sys/kernel/mm/swap/tiers

This file is used to define the global swap tiers and their associated
minimum priority levels.

- To add tiers:
  Format: + 'tier_name':'prio'[,|' ']'tier_name 2':'prio']...
  Example:
  # echo "+ SSD:100,HDD:2" > /sys/kernel/mm/swap/tiers

  There are several rules for defining tiers:
  - Priority ranges for tiers must not overlap.
  - The combination of all defined tiers must cover the entire valid
    priority range (DEF_SWAP_PRIO to SHRT_MAX) to ensure every swap device
    can be assigned to a tier.
  - A tier's prio value is its inclusive lower bound,
    covering priorities up to the next tier's prio.
    The highest tier extends to SHRT_MAX, and the lowest tier extends to DEF_SWAP_PRIO.
  - If the specified tiers do not cover the entire priority range,
    the priority of the tier with the lowest specified priority value
    is set to SHRT_MIN
  - The total number of tiers is limited. 

- To remove tiers:
  Format: - 'tier_name'[,|' ']'tier_name2']...
  Example:
  # echo "- SSD,HDD" > /sys/kernel/mm/swap/tiers

  Note: A tier cannot be removed if it is currently in use by any
  cgroup or if any active swap device is assigned to it. This acts as
  a reference count to prevent disruption.

- To show current tiers:
  Reading the file displays the currently configured tiers, their
  internal index, and the priority range they cover.
  Example:
  # echo "+ SSD:100,HDD:2" > /sys/kernel/mm/swap/tiers
  # cat /sys/kernel/mm/swap/tiers
  Name      Idx   PrioStart   PrioEnd
            0
  SSD       1    100         32767
  HDD       2     -1         99

  - `Name`: The name of the tier. The unnamed entry is a default tier.
  - `Idx`: The internal index assigned to the tier.
  - `PrioStart`: The starting priority of the range covered by this tier.
  - `PrioEnd`: The ending priority of the range covered by this tier.

Two special tiers are predefined:
- "": Represents the default inheritance behavior in cgroups.
- "zswap": Reserved for zswap integration.

2. Cgroup Tier Selection: memory.swap.tiers

This file controls which swap tiers are enabled for a given cgroup.

- Reading the file:
  The first line shows the operation that was written to the file.
  The second line shows the final, effective set of tiers after
  merging with the parent cgroup's configuration.

- Writing to the file:
  Format: [+/-] [+|-][TIER_NAME]...
  - `+TIER_NAME`: Explicitly enables this tier for the cgroup.
  - `-TIER_NAME`: Explicitly disables this tier for the cgroup.
  - If a tier is not specified, its setting is inherited from the
    parent cgroup.
  - A standalone `+` at the beginning resets the configuration: it
    ignores the parent's settings, enables all globally defined tiers,
    and then applies the subsequent operations in the command.
  - A standalone `-` at the beginning also resets: it ignores the
    parent's settings, disables all tiers, and then applies subsequent
    operations.
  - The root cgroup defaults to an implicit `+`, enabling all swap
    devices.

  Example:
  # echo "+ -SSD -HDD" > /sys/fs/cgroup/my_cgroup/memory.swap.tiers
  This command first resets the cgroup's configuration to enable all
  tiers (due to the leading `+`), and then explicitly disables the
  "SSD" and "HDD" tiers.

Further Discussion and Open Questions
-------------------------------------
I seek feedback on this concept and have identified several key
points that require further discussion (though this is not an 
exhaustive list). This topic will also be presented at the upcoming 
Linux Plumbers Conference 2025 [3], and I would appreciate any 
feedback here on the list beforehand, or in person at the conference.

1.  The swap fast path utilizes a percpu cluster cache for efficiency.
    In swap tiers, this has been changed to a per-device per-cpu 
    cluster cache. (See the first patch in this series.)
    An alternative approach would be to cache only the swap_info_struct 
    (si) per-tier per-cpu, avoiding cluster caching entirely while still 
    maintaining fast device acquisition without `swap_avail_lock`.
    Should we pursue this alternative, or is the current per-device 
    per-cpu cluster caching approach preferable?

2.  Consistency with cgroup parent-child semantics: Unlike general
    resource distribution, tier selection may bypass parent
    constraints (e.g., a child can enable a tier disabled by its
    parent). Is this behavior acceptable?

3.  Per-cgroup swap tier limit: Is a `swap.tier.max` needed in
    addition to the existing `swap.max`?

4.  Parent-child tier mismatch: If a zombie memcg (child) uses a tier
    that is not available to its new parent, how should this be
    handled during recharging or reparenting? (This question is raised
    in the context of ongoing work to improve memcg reparenting and
    handle zombie memcgs [4, 5].)

5.  Tier mask calculation: What are the trade-offs between calculating
    the effective tier mask at runtime vs. pre-calculating it when the
    interface is written to?

6.  If a swap tier configuration is applied to a memcg, should we
    migrate existing swap-out pages that are on devices not belonging
    to any of the cgroup's allowed tiers?

7.  swap tier could be good abstraction layer. Discuss on extended usage of swap tiers.

Any feedback on the overall concept, interface, and these specific
points would be greatly appreciated.

Best Regards,
Youngjun Park

References
----------
[1] https://lore.kernel.org/linux-mm/aEvLjEInMQC7hEyh@yjaykim-PowerEdge-T330/T/#mbbb6a5e9e30843097e1f5f65fb98f31d582b973d
[2] https://lore.kernel.org/linux-mm/20250716202006.3640584-1-youngjun.park@lge.com/
[3] https://lpc.events/event/19/abstracts/2296/
[4] https://lore.kernel.org/linux-mm/20230720070825.992023-1-yosryahmed@google.com/
[5] https://blogs.oracle.com/linux/post/zombie-memcg-issues

Youngjun Park (3):
  mm, swap: change back to use each swap device's percpu cluster
  mm: swap: introduce swap tier infrastructure
  mm/swap: integrate swap tier infrastructure into swap subsystem

 Documentation/admin-guide/cgroup-v2.rst |  32 ++
 MAINTAINERS                             |   2 +
 include/linux/memcontrol.h              |   4 +
 include/linux/swap.h                    |  16 +-
 mm/Kconfig                              |  13 +
 mm/Makefile                             |   1 +
 mm/memcontrol.c                         |  69 +++
 mm/page_io.c                            |  21 +-
 mm/swap.h                               |   4 +
 mm/swap_state.c                         |  93 ++++
 mm/swap_tier.c                          | 602 ++++++++++++++++++++++++
 mm/swap_tier.h                          |  75 +++
 mm/swapfile.c                           | 169 +++----
 13 files changed, 987 insertions(+), 114 deletions(-)
 create mode 100644 mm/swap_tier.c
 create mode 100644 mm/swap_tier.h

base-commit: 02dafa01ec9a00c3758c1c6478d82fe601f5f1ba
-- 
2.34.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ