lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <uyxkdmnmvjipxuf7gagu2okw7afvzlclomfmc6wb6tygc3mhv6@736m7xs6gn5q>
Date: Thu, 14 Aug 2025 16:03:36 +0200
From: Michal Koutný <mkoutny@...e.com>
To: YoungJun Park <youngjun.park@....com>
Cc: akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org, 
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev, 
	shikemeng@...weicloud.com, kasong@...cent.com, nphamcs@...il.com, bhe@...hat.com, 
	baohua@...nel.org, chrisl@...nel.org, cgroups@...r.kernel.org, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, gunho.lee@....com, iamjoonsoo.kim@....com, taejoon.song@....com
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
 cgroup-based swap priority

On Wed, Jul 23, 2025 at 03:41:47AM +0900, YoungJun Park <youngjun.park@....com> wrote:
> This leaves us with a few design options:
> 
> 1. Treat negative values as valid priorities. Once any device is
>    assigned via `memory.swap.priority`, the NUMA autobind logic is
>    entirely disabled.
>    - Pros: Simplifies implementation; avoids exposing NUMA autobind via
>      cgroup interface.
>    - Cons: Overrides autobind for all devices even if only one is set.
> 
> 2. Continue to treat negative values as NUMA autobind weights, without
>    implicit shifting. If a user assigns `-3`, it is stored and used
>    exactly as `-3`, and does not affect other devices.
>    - Pros: Simple and intuitive; matches current implementation
>      semantics.
>    - Cons: Autobind semantics still need to be reasoned about when
>      using the interface.
> 
> 3. Disallow setting negative values via `memory.swap.priority`.
>    Existing NUMA autobind config is preserved, but no new autobind
>    configuration is possible from cgroup interface.
>    - Pros: Keeps cgroup interface simple; no autobind manipulation.
>    - Cons: Autobind infra remains partially active, increasing code
>      complexity.
> 
> 4. Keep the current design: allow setting negative values to express
>    NUMA autobind weights explicitly. Devices without overridden values
>    continue to follow NUMA-based dynamic selection.
>    - Pros: Preserves current flexibility; gives users control per device.
>    - Cons: Slightly more complex semantics; NUMA autobind remains a
>      visible part of the interface.
> 
> After thinking through these tradeoffs, I'm inclined to think that
> preserving the NUMA autobind option might be the better path forward.
> What are your thoughts on this?
> 
> Thank you again for your helpful feedback.

Let me share my mental model in order to help forming the design.

I find these per-cgroup swap priorities similar to cpuset -- instead of
having a configured cpumask (bitmask) for each cgroup, you have
weight-mask for individual swap devices (or distribution over the
devices, I hope it's not too big deviation from priority ranking).
Then you have the hierarchy, so you need a method how to combine
child+parent masks (or global/root) to obtain effective weight-mask (and
effective ranking) for each cgroup.

Furthermore, there's the NUMA autobinding which adds another weight-mask
to the game but this time it's not configured but it depends on "who is
asking". (Tasks running on node N would have autobind shifted towards
devices associated to node N. Is that how autobinding works?)

From the hierarchy point of view, you have to compound weight-masks in
top-down preference (so that higher cgroups can override lower) and
autobind weight-mask that is only conceivable at the very bottom
(not a cgroup but depending on the task's NUMA placement).

There I see conflict between the ends a tad. I think the attempted
reconciliation was to allow emptiness of a single slot in the
weight-mask but it may not be practical for the compounding (that's why
you came up with the four variants). So another option would be to allow
whole weight-mask being empty (or uniform) so that it'd be identity in
the compounding operation.
The conflict exists also in the current non-percg priorities -- there
are the global priorities and autobind priorities. IIUC, the global
level either defines a weight (user prio) or it is empty (defer to NUMA
autobinding).

[I leveled rankings and weight-masks of devices but I left a loophole of
how the empty slots in the latter would be converted to (and from)
rankings. This e-mail is already too long.]


An very different alternative that comes to my mind together with
autobinding and leveraging that to your use case:
- define virtual NUMA nodes [1],
- associate separate swap devices to those nodes,
- utilize task (or actual (mem)cpuset) affinity to those virtual NUMA
  nodes based on each process's swap requirements,
- NUMA autobinding would then yield the device constraints you sought.


HTH,
Michal


[1] Not sure how close this is to the linked series [2] which is AFAIU
    a different kind of virtualization that isn't supposed to be exposed
    to userspace(?).
[2] https://lore.kernel.org/linux-mm/20250429233848.3093350-1-nphamcs@gmail.com/


Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ