linux-kernel - Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aKC0vrr0vIdRV/Ob@yjaykim-PowerEdge-T330>
Date: Sun, 17 Aug 2025 01:41:34 +0900
From: YoungJun Park <youngjun.park@....com>
To: Michal Koutný <mkoutny@...e.com>
Cc: akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org,
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev,
	muchun.song@...ux.dev, shikemeng@...weicloud.com,
	kasong@...cent.com, nphamcs@...il.com, bhe@...hat.com,
	baohua@...nel.org, chrisl@...nel.org, cgroups@...r.kernel.org,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, gunho.lee@....com,
	iamjoonsoo.kim@....com, taejoon.song@....com
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
 cgroup-based swap priority

On Thu, Aug 14, 2025 at 04:03:36PM +0200, Michal Koutný wrote:
> On Wed, Jul 23, 2025 at 03:41:47AM +0900, YoungJun Park <youngjun.park@....com> wrote:

> Let me share my mental model in order to help forming the design.

First of all, thank you very much for your detailed reply. As Friday was a
public holiday in Korea and I had some personal commitments over the weekend,
I only got to check your email late — I hope you can kindly excuse the delayed
response.

For the points that require deeper consideration, I will provide detailed
answers later. For now, let me share some quick feedback on the parts I can
respond to right away.

> I find these per-cgroup swap priorities similar to cpuset -- instead of
> having a configured cpumask (bitmask) for each cgroup, you have
> weight-mask for individual swap devices (or distribution over the
> devices, I hope it's not too big deviation from priority ranking).
> Then you have the hierarchy, so you need a method how to combine
> child+parent masks (or global/root) to obtain effective weight-mask (and
> effective ranking) for each cgroup.
> 
> Furthermore, there's the NUMA autobinding which adds another weight-mask
> to the game but this time it's not configured but it depends on "who is
> asking". (Tasks running on node N would have autobind shifted towards
> devices associated to node N. Is that how autobinding works?)

Yes, your description indeed captures the core concept of how autobinding
works.
 
> From the hierarchy point of view, you have to compound weight-masks in
> top-down preference (so that higher cgroups can override lower) and
> autobind weight-mask that is only conceivable at the very bottom
> (not a cgroup but depending on the task's NUMA placement).
> 
> There I see conflict between the ends a tad. I think the attempted
> reconciliation was to allow emptiness of a single slot in the
> weight-mask but it may not be practical for the compounding (that's why
> you came up with the four variants). So another option would be to allow
> whole weight-mask being empty (or uniform) so that it'd be identity in
> the compounding operation.
> The conflict exists also in the current non-percg priorities -- there
> are the global priorities and autobind priorities. IIUC, the global
> level either defines a weight (user prio) or it is empty (defer to NUMA
> autobinding).
> 
> [I leveled rankings and weight-masks of devices but I left a loophole of
> how the empty slots in the latter would be converted to (and from)
> rankings. This e-mail is already too long.]

Yes. A single slot emptiness is enemy..
The problem arises from two aspects: (1) allowing per-device priorities
inherently leads to the possibility of single-slot emptiness, and (2)
depending on swapon configuration, empty slots may be inevitable. That’s
why the compounding rules ended up allowing this complexity. I’ll review
your suggestions carefully and share soon how we might simplify this
direction.

> 
> An very different alternative that comes to my mind together with
> autobinding and leveraging that to your use case:
> - define virtual NUMA nodes [1],
> - associate separate swap devices to those nodes,
> - utilize task (or actual (mem)cpuset) affinity to those virtual NUMA
>   nodes based on each process's swap requirements,
> - NUMA autobinding would then yield the device constraints you sought.

Creative. I have understood the overall concept for now.

Thank you as always for your valuable insights.

Best regards,  
YoungJun