linux-kernel - Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF8kJuMo3yNKOZL9n5UkHx_O5cTZts287HOnQOu=KqQcnbrMdg@mail.gmail.com>
Date: Fri, 15 Aug 2025 08:10:09 -0700
From: Chris Li <chrisl@...nel.org>
To: Michal Koutný <mkoutny@...e.com>
Cc: YoungJun Park <youngjun.park@....com>, akpm@...ux-foundation.org, hannes@...xchg.org, 
	mhocko@...nel.org, roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, 
	muchun.song@...ux.dev, shikemeng@...weicloud.com, kasong@...cent.com, 
	nphamcs@...il.com, bhe@...hat.com, baohua@...nel.org, cgroups@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, gunho.lee@....com, 
	iamjoonsoo.kim@....com, taejoon.song@....com, 
	Matthew Wilcox <willy@...radead.org>, David Hildenbrand <david@...hat.com>, Kairui Song <ryncsn@...il.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
 cgroup-based swap priority

Hi Michal and YoungJun,

I am sorry for the late reply. I have briefly read through the patches
series the overall impression:
1)  Priority is not the best way to select which swap file to use per cgroup.
The priority is assigned to one device, it is a per swap file local
change. The effect you want to see is actually a global one, how this
swap device compares to other devices. You actually want  a list at
the end result. Adjusting per swap file priority is backwards. A lot
of unnecessary usage complexity and code complexity come from that.
2)  This series is too complicated for what it does.

I have a similar idea, "swap.tiers," first mentioned earlier here:
https://lore.kernel.org/linux-mm/CAF8kJuNFtejEtjQHg5UBGduvFNn3AaGn4ffyoOrEnXfHpx6Ubg@mail.gmail.com/

I will outline the line in more detail in the last part of my reply.

BTW, YoungJun and Michal, do you have the per cgroup swap file control
proposal for this year's LPC? If you want to, I am happy to work with
you on the swap tiers topic as a secondary. I probably don't have the
time to do it as a primary.

On Thu, Aug 14, 2025 at 7:03 AM Michal Koutný <mkoutny@...e.com> wrote:
>
> On Wed, Jul 23, 2025 at 03:41:47AM +0900, YoungJun Park <youngjun.park@....com> wrote:
> >
> > After thinking through these tradeoffs, I'm inclined to think that
> > preserving the NUMA autobind option might be the better path forward.
> > What are your thoughts on this?

The swap allocator has gone through a complete rewrite. We need to
revisit whether the NUMA autobinding thing is still beneficial in the
new swap allocator. We need more data points. Personally I would like
to decouple the NUMA to the swap device. If the swap device needs more
sharding, we can do more sharding without NUMA nodes. Using NUMA nodes
is just one way of sharding. Should not be the only way to do
sharding. Coupling the swap device with NUMA nodes makes things really
complicated. It would need a lot of performance difference to justify
that kind of complexity.

> > Thank you again for your helpful feedback.
>
> Let me share my mental model in order to help forming the design.
>
> I find these per-cgroup swap priorities similar to cpuset -- instead of
> having a configured cpumask (bitmask) for each cgroup, you have
> weight-mask for individual swap devices (or distribution over the
> devices, I hope it's not too big deviation from priority ranking).

+1. The swap tiers I have in mind is very close to what you describe

> Then you have the hierarchy, so you need a method how to combine
> child+parent masks (or global/root) to obtain effective weight-mask (and
> effective ranking) for each cgroup.

Yes, swap tiers has a hierarchy module story as well. Will talk about
that in a later part of the email.

>
> Furthermore, there's the NUMA autobinding which adds another weight-mask
> to the game but this time it's not configured but it depends on "who is
> asking". (Tasks running on node N would have autobind shifted towards
> devices associated to node N. Is that how autobinding works?)

Again, I really wish the swap file selection decouples from the NUMA nodes.

> From the hierarchy point of view, you have to compound weight-masks in
> top-down preference (so that higher cgroups can override lower) and
> autobind weight-mask that is only conceivable at the very bottom
> (not a cgroup but depending on the task's NUMA placement).

I want to abandon weight adjusting, focus on opt in or out.

> There I see conflict between the ends a tad. I think the attempted
> reconciliation was to allow emptiness of a single slot in the

I think adjusting a single swap file to impact the relative order is backwards.

> weight-mask but it may not be practical for the compounding (that's why
> you came up with the four variants). So another option would be to allow
> whole weight-mask being empty (or uniform) so that it'd be identity in
> the compounding operation.
> The conflict exists also in the current non-percg priorities -- there
> are the global priorities and autobind priorities. IIUC, the global
> level either defines a weight (user prio) or it is empty (defer to NUMA
> autobinding).
>
> [I leveled rankings and weight-masks of devices but I left a loophole of
> how the empty slots in the latter would be converted to (and from)
> rankings. This e-mail is already too long.]

OK. I want to abandon the weight-adjustment approach. Here I outline
the swap tiers idea as follows. I can probably start a new thread for
that later.

1) No per cgroup swap priority adjustment. The swap file priority is
global to the system.
Per cgroup swap file ordering adjustment is bad from the LRU point of
view. We should make the swap file ordering matching to the swap
device service performance. Fast swap tier zram, zswap store hotter
data, slower tier hard drive store colder data.  SSD in between. It is
important to maintain the fast slow tier match to the hot cold LRU
ordering.

2) There is a simple mapping of global swap tier names into priority range
The name itself is customizable.
e.g. 100+ is the "compress_ram" tier. 50-99 is the "SSD" tier, 0-55 is
the "hdd" tier.
The detailed mechanization and API is TBD.
The end result is a simple tier name lookup will get the priority range.
By default all swap tiers are available for global usage without
cgroup. That matches the current global swap on behavior.

3) Each cgroup will have "swap.tiers" (name TBD) to opt in/out of the tier.
It is a list of tiers including the default tier who shall not be named.

Here are a few examples:
e.g. consider the following cgroup hierarchy a/b/c/d, a as the first
level cgroup.
a/swap.tiers: "- +compress_ram"
it means who shall not be named is set to opt out,  optin in
compress_ram only, no ssd, no hard.
Who shall not be named, if specified, has to be the first one listed
in the "swap.tiers".

a/b/swap.tiers: "+ssd"
For b cgroup, who shall not be named is not specified, the tier is
appended to the parent "a/swap.tiers". The effective "a/b/swap.tiers"
become "- +compress_ram +ssd"
a/b can use both zswap and ssd.

Every time the who shall not be named is changed, it can drop the
parent swap.tiers chain, starting from scratch.

a/b/c/swap.tiers: "-"

For c, it turns off all swap. The effective "a/b/c/swap.tiers" become
"- +compress_ram +ssd -" which simplify as "-", because the second "-"
overwrites all previous optin/optout results.
In other words, if the current cgroup does not specify the who shall
not be named, it will walk the parent chain until it does. The global
"/" for non cgroup is on.

a/b/c/d/swap.tiers: "- +hdd"
For d, only hdd swap, nothing else.

More example:
 "- +ssd +hdd -ssd" will simplify to: "- +hdd", which means hdd only.
 "+ -hdd": No hdd for you! Use everything else.

Let me know what you think about the above "swap.tiers"(name TBD) proposal.

Chris