linux-kernel - Re: [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=OLqqvXCqBcTotAAWWx=dqLr9xk-Nw-=Hh5yUVZokzXgQ@mail.gmail.com>
Date: Thu, 12 Jun 2025 14:32:53 -0700
From: Nhat Pham <nphamcs@...il.com>
To: Kairui Song <ryncsn@...il.com>
Cc: youngjun.park@....com, linux-mm@...ck.org, akpm@...ux-foundation.org, 
	hannes@...xchg.org, mhocko@...nel.org, roman.gushchin@...ux.dev, 
	shakeel.butt@...ux.dev, cgroups@...r.kernel.org, linux-kernel@...r.kernel.org, 
	shikemeng@...weicloud.com, bhe@...hat.com, baohua@...nel.org, 
	chrisl@...nel.org, muchun.song@...ux.dev, iamjoonsoo.kim@....com, 
	taejoon.song@....com, gunho.lee@....com
Subject: Re: [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization

On Thu, Jun 12, 2025 at 5:24 AM Kairui Song <ryncsn@...il.com> wrote:
>
> On Thu, Jun 12, 2025 at 6:38 PM <youngjun.park@....com> wrote:
> >
> > From: Youngjun Park <youngjun.park@....com>
> >
> > Introduction
> > ============
> > I am a kernel developer working on platforms deployed on commercial consumer devices.
> > Due to real-world product requirements, needed to modify the Linux kernel to support
> > a new swap management mechanism. The proposed mechanism allows assigning different swap
> > priorities to swap devices per cgroup.
> > I believe this mechanism can be generally useful for similar constrained-device scenarios
> > and would like to propose it for upstream inclusion and solicit feedback from the community.

We're mostly just using zswap and disk swap, for now, so I don't have
too much input for this.

Kairui, would this design satisfy your zram use case as well?

> >
> > Motivation
> > ==========
> > Core requirement was to improve application responsiveness and loading time, especially
> > for latency critical applications, without increasing RAM or storage hardware resources.
> > Device constraints:
> >   - Linux-based embedded platform
> >   - Limited system RAM
> >   - Small local swap
> >   - No option to expand RAM or local swap
> > To mitigate this, we explored utilizing idle RAM and storage from nearby devices as remote
> > swap space. To maximize its effectiveness, we needed the ability to control which swap devices
> > were used by different cgroups:
> >   - Assign faster local swap devices to latency critical apps
> >   - Assign remote swap devices to background apps
> > However, current Linux kernel swap infrastructure does not support per-cgroup swap device
> > assignment.
> > To solve this, I propose a mechanism to allow each cgroup to specify its own swap device
> > priorities.
> >
> > Evaluated Alternatives
> > ======================
> > 1. **Per-cgroup dedicated swap devices**
> >    - Previously proposed upstream [1]
> >    - Challenges in managing global vs per-cgroup swap state
> >    - Difficult to integrate with existing memory.limit / swap.max semantics
> > 2. **Multi-backend swap device with cgroup-aware routing**
> >    - Considered sort of layering violation (block device cgroup awareness)
> >    - Swap devices are commonly meant to be physical block devices.
> >    - Similar idea mentioned in [2]
> > 3. **Per-cgroup swap device enable/disable with swap usage contorl**
> >    - Expand swap.max with zswap.writeback usage
> >    - Discussed in context of zswap writeback [3]
> >    - Cannot express arbitrary priority orderings
> >     (e.g. swap priority A-B-C on cgroup C-A-B impossible)
> >    - Less flexible than per-device priority approach
> > 4. **Per-namespace swap priority configuration**
> >    - In short, make swap namespace for swap device priority
> >    - Overly complex for our use case
> >    - Cgroups are the natural scope for this mechanism
> >
> > Based on these findings, we chose to prototype per-cgroup swap priority configuration
> > as the most natural, least invasive extension of the existing kernel mechanisms.
> >
> > Design and Semantics
> > ====================
> > - Each swap device gets a unique ID at `swapon` time
> > - Each cgroup has a `memory.swap.priority` interface:
> >   - Show unique ID by memory.swap.priority interface
> >   - Format: `unique_id:priority,unique_id:priority,...`
> >   - All currently-active swap devices must be listed
> >   - Priorities follow existing swap infrastructure semantics
> > - The interface is writeable and updatable at runtime
> > - A priority configuration can be reset via `echo "" > memory.swap.priority`
> > - Swap on/off events propagate to all cgroups with priority configurations
> >
> > Example Usage
> > -------------
> > # swap device on
> > $ swapon
> > NAME      TYPE      SIZE USED PRIO
> > /dev/sdb  partition 300M  0B   10
> > /dev/sdc  partition 300M  0B    5
> >
> > # assign custom priorities in a cgroup
> > $ echo "1:5,2:10" > memory.swap.priority
> > $ cat memory.swap.priority
> > Active
> > /dev/sdb  unique:1  prio:5
> > /dev/sdc  unique:2  prio:10
> >
> > # adding new swap device later
> > $ swapon /dev/sdd --priority -1
> > $ cat memory.swap.priority
> > Active
> > /dev/sdb  unique:1  prio:5
> > /dev/sdc  unique:2  prio:10
> > /dev/sdd  unique:3  prio:-2
> >
> > # reset cgroup priority
> > $ echo "" > memory.swap.priority
> > $ cat memory.swap.priority
> > Inactive
> > /dev/sdb  unique:1  prio:10
> > /dev/sdc  unique:2  prio:5
> > /dev/sdd  unique:3  prio:-2
> >
> > Implementation Notes
> > ====================
> > The items mentioned below are to be considered during the next patch work.
> >
> > - Workaround using per swap cpu cluster as before
> > - Priority propgation of child cgroup
> > - And other TODO, XXX
> > - Refactoring for reviewability and maintainability, comprehensive testing
> >   and performance evaluation
>
> Hi Youngjun,
>
> Interesting idea. For your current approach, I think all we need is
> per-cgroup swap meta info structures (and infrastures for maintaining
> and manipulating them).

Agreed.

>
> So we have a global version and a cgroup version of "plist, next
> cluster list, and maybe something else", right? And then
> once the allocator is folio aware it can just prefer the cgroup ones
> (as I mentioned in another reply) reusing all the same other
> routines. Changes are minimal, the cgroup swap meta infos
> and control plane are separately maintained.
>
> It seems aligned quite well with what I wanted to do, and can be done
> in a clean and easy to maintain way.
>
> Meanwhile with virtual swap, things could be even more flexible, not
> only changing the priority at swapout time, it will also provide
> capabilities to migrate and balance devices adaptively, and solve long
> term issues like mTHP fragmentation and min-order swapout etc..

Agreed.

>
> Maybe they can be combined, like maybe cgroup can be limited to use
> the virtual device or physical ones depending on priority. Seems all
> solvable. Just some ideas here.

100%

>
> Vswap can cover the priority part too. I think we might want to avoid
> duplicated interfaces.

Yeah as long as we have a reasonable cgroup interface, we can always
change the implementation later. We can move things to virtual swap,
etc. at a latter time.

>
> So I'm just imagining things now, will it be good if we have something
> like (following your design):
>
> $ cat memcg1/memory.swap.priority
> Active
> /dev/vswap:(zram/zswap? with compression params?) unique:0 prio:5
>
> $ cat memcg2/memory.swap.priority
> Active
> /dev/vswap:/dev/nvme1  unique:1  prio:5
> /dev/vswap:/dev/nvme2  unique:2  prio:10
> /dev/vswap:/dev/vda  unique:3  prio:15
> /dev/sda  unique:4  prio:20
>
> $ cat memcg3/memory.swap.priority
> Active
> /dev/vda  unique:3  prio:5
> /dev/sda  unique:4  prio:15
>
> Meaning memcg1 (high priority) is allowed to use compressed memory
> only through vswap, and memcg2 (mid priority) uses disks through vswap
> and fallback to HDD. memcg3 (low prio) is only allowed to use slow
> devices.
>
> Global fallback just uses everything the system has. It might be over
> complex though?

Sounds good to me.

>
>
> >
> > Future Work
> > ===========
> > These are items that would benefit from further consideration
> > and potential implementation.
> >
> > - Support for per-process or anything else swap prioritization

This might be too granular.


> > - Optional usage limits per swap device (e.g., ratio, max bytes)
> > - Generalizing the interface beyond cgroups
> >
> > References
> > ==========
> > [1] https://lkml.iu.edu/hypermail/linux/kernel/1404.0/02530.html
> > [2] https://lore.kernel.org/linux-mm/CAMgjq7DGMS5A4t6nOQmwyLy5Px96aoejBkiwFHgy9uMk-F8Y-w@mail.gmail.com
> > [3] https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com
> >
> > All comments and feedback are greatly appreciated.
> > Patch will follow.
> >
> > Sincerely,
> > Youngjun Park
> >
> > youngjun.park (2):
> >   mm/swap, memcg: basic structure and logic for per cgroup swap priority
> >     control
> >   mm: swap: apply per cgroup swap priority mechansim on swap layer
> >
> >  include/linux/memcontrol.h |   3 +
> >  include/linux/swap.h       |  11 ++
> >  mm/Kconfig                 |   7 +
> >  mm/memcontrol.c            |  55 ++++++
> >  mm/swap.h                  |  18 ++
> >  mm/swap_cgroup_priority.c  | 335 +++++++++++++++++++++++++++++++++++++
> >  mm/swapfile.c              | 129 ++++++++++----
> >  7 files changed, 523 insertions(+), 35 deletions(-)
> >  create mode 100644 mm/swap_cgroup_priority.c
> >
> > base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
> > --
> > 2.34.1
> >
> >