linux-kernel - Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF8kJuMb5i6GuD_-XWtHPYnu-8dQ0W51_KqUk60DccqbKjNq6w@mail.gmail.com>
Date: Fri, 22 Aug 2025 09:48:33 -0700
From: Chris Li <chrisl@...nel.org>
To: YoungJun Park <youngjun.park@....com>
Cc: Michal Koutný <mkoutny@...e.com>, 
	akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org, 
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev, 
	shikemeng@...weicloud.com, kasong@...cent.com, nphamcs@...il.com, 
	bhe@...hat.com, baohua@...nel.org, cgroups@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, gunho.lee@....com, 
	iamjoonsoo.kim@....com, taejoon.song@....com, 
	Matthew Wilcox <willy@...radead.org>, David Hildenbrand <david@...hat.com>, Kairui Song <ryncsn@...il.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
 cgroup-based swap priority

On Thu, Aug 21, 2025 at 10:45 PM YoungJun Park <youngjun.park@....com> wrote:
>
> I still believe that the priority based approach has more flexibility,
> and can cover more usage scenarios. That opinion has not changed.

I agree with you on that. It is more flexible that way, no question about it.

I am open to considering your usage scenarios and revisit the
swap.tiers limitation. I just haven't seen the real usage scenario
yet.

> However, from this discussion I came to clearly understand and agree on
> three points:
>
> 1. The swap.tier idea can be implemented in a much simpler way, and
> 2. It can cover the most important use cases I initially needed, as well
>    as common performance scenarios, without causing LRU inversion.
Glad we are aligned on this.

> 3. The really really needed usage scenario of arbitrary ordering does not exist.
> the usage scenario I suggest is imaginary.(just has possibility)
Wow, that is surprise for me to see that from you. I was expecting
some very complex or special usage case demand on the arbitrary
ordering. If it is just an imaginary usage scenario, I am very glad we
did not pay the price of extra complexity for imaginary usage.

> I have also considered the situation where I might need to revisit my
> original idea in the future. I believe this would still be manageable
> within the swap.tier framework. For example:

Sure, having an incremental improvement is a good thing. We can always
come back and revisit if the reasoning for the previous decision is
still valid or not.

> * If after swap.tier is merged, an arbitrate ordering use case arises
>   (which you do not consider concrete), it could be solved by allowing
>   cgroups to remap the tier order individually.

Ack.

> * If reviewers later decide to go back to the priority based direction,
>   I think it will still be possible. By then, much of the work would
>   already be done in patch v2, so switching back would not be
>   impossible.

I really doubt that we need to get back to the pure priority approach.

> And also, since I highly respect you for long-time contributions and
> deep thinking in the swap layer, I decided to move the idea forward
> based on swap.tier.

Thank you. I really appreciate you taking the feedback with flexibility.

> For now, I would like to share the first major direction change I am
> considering, and get feedback on how to proceed. If you think this path
> is promising, please advise whether I should continue as patch v2, or
> send a new RFC series or new patch series.
>
> -----------------------------------------------------------------------
> 1. Interface
> -----------------------------------------------------------------------
>
> In the initial thread you replied with the following examples:
>
> > Here are a few examples:
> > e.g. consider the following cgroup hierarchy a/b/c/d, a as the first
> > level cgroup.
> > a/swap.tiers: "- +compress_ram"
> > it means who shall not be named is set to opt out, optin in
> > compress_ram only, no ssd, no hard.
> > Who shall not be named, if specified, has to be the first one listed
> > in the "swap.tiers".
> >
> > a/b/swap.tiers: "+ssd"
> > For b cgroup, who shall not be named is not specified, the tier is
> > appended to the parent "a/swap.tiers". The effective "a/b/swap.tiers"
> > become "- +compress_ram +ssd"
> > a/b can use both zswap and ssd.
> >
> > Every time the who shall not be named is changed, it can drop the
> > parent swap.tiers chain, starting from scratch.
> >
> > a/b/c/swap.tiers: "-"
> >
> > For c, it turns off all swap. The effective "a/b/c/swap.tiers" become
> > "- +compress_ram +ssd -" which simplify as "-", because the second "-"
> > overwrites all previous optin/optout results.
> > In other words, if the current cgroup does not specify the who shall
> > not be named, it will walk the parent chain until it does. The global
> > "/" for non cgroup is on.
> >
> > a/b/c/d/swap.tiers: "- +hdd"
> > For d, only hdd swap, nothing else.
> >
> > More example:
> > "- +ssd +hdd -ssd" will simplify to: "- +hdd", which means hdd only.
> > "+ -hdd": No hdd for you! Use everything else.
> >
> > Let me know what you think about the above "swap.tiers"(name TBD)
> > proposal.
>
> My opinion is that instead of mapping priority into named concepts, it
> may be simpler to represent it as plain integers.

In my mind, the tier name is just a look up to a bit in the bit mask.
Give it a name so it is easier to distinguish with the other number
e.g. priority number.

> (The integers are assigned in sequential order, as explained in the following reply.)
> This would make the interface almost identical to the cpuset style suggested by Koutný.
>
> For example:
>
>   echo 1-8,9-10 > a/swap.tier   # parent allows tier range 1–8 and 9-10

swap.tiers, it can have more than one tier.

How do you express the default tier who shall not name? There are
actually 3 states associated with default. It is not binary.
1) default not specified: look up parent chain for default.
2) default specified as on. Override parent default.
3) default specified as off. Override parent default.

e.g. "- +zswap +ssd" means default off, allow zswap and sdd tiers.

>   echo 1-4,9    > a/b/swap.tier # child uses tier 1-4 and 9 within parent's range
>   echo 20   > a/b/swap.tier # invalid: parent only allowed 1-8 and 9-10

How are you going to store the list of ranges? Just  a bitmask integer
or a list?
I feel the tier name is more readable. The number to which actual
device mapping is non trivial to track for humans.
Adding a name to a tier object is trivial. Using the name is more convenient.
We might be able to support both if we make up a rule that tier names
can't be pure numbers.

I want to add another usage case into consideration. The swap.tiers
does not have to be per cgroup. It can be per VMA. We can extend the
"madvise" syscall so the user space can advise to the kernel, I only
want this memory  range which contains my private key swap to zswap,
not hdd. So that if there is an unexpected power off event,  my
private key will not end up in the hdd. In RAM or zswap is fine
because they will be gone when power off.
>
> named concepts can be dealt with by some userland based software solution.
> kernel just gives simple integer mapping concept.
> userland software can abstract it as a "named" tier to user.

The kernel will need to manage the tier object anyway, which range it
covers, having a name there is trivial. I consider it just convenient
for system admins. Pure tier number map to another priority number is
a bit cryptic.

> Regarding the mapping of names to ranges, as you also mentioned:
>
> > There is a simple mapping of global swap tier names into priority
> > range
> > The name itself is customizable.
> > e.g. 100+ is the "compress_ram" tier. 50-99 is the "SSD" tier,
> > 0-55 is the "hdd" tier.
> > The detailed mechanization and API is TBD.
> > The end result is a simple tier name lookup will get the priority
> > range.
> > By default all swap tiers are available for global usage without
> > cgroup. That matches the current global swap on behavior.
>
> One idea would be to provide a /proc/swaptier interface:

Maybe stay away from  '/proc'. Maybe some thing like "/sys/kernel/mm/swap"
>
>   echo "100 40" > /proc/swaptier
>
> This would mean:
> * >=100 : tier 1
> * 40–99 : tier 2
> * <40   : tier 3
>
> How do you feel about this approach?
Sounds fine. Maybe we can have
"ssd:100 zswap:40 hdd" for the same thing but give a name to the tier
as well.You can still reference the tier by numbers.

>
> -----------------------------------------------------------------------
> 2. NUMA autobind
> -----------------------------------------------------------------------
>
> If NUMA autobind is in use, perhaps it is best to simply disallow
> swaptier settings. I expect workloads depending on autobind would rely
> on it globally, rather than per-cgroup. Therefore, when a negative
> priority is present, tier grouping could reject the configuration.

Can you elaborate on that. Just brainstorming, can we keep the
swap.tiers and assign NUMA autobind range to tier as well? It is just
negative ranges, we can assign negative ranges to say "NUMA" tier.
Then if the swap.tiers contain "ssd NUMA" then it is as if the system
only configures ssd and numa globally. Frankly I don't think the NUMA
autobind swap matters any more in the new swap allocator. It can also
make up rules that if swap.tiers was used, no NUMA autobinds for that
cgroup.

>
> -----------------------------------------------------------------------
> 3. Implementation
> -----------------------------------------------------------------------
>
> My initial thought is to implement a simple bitmask check. That is, in
> the slow swap path, check whether the cgroup has selected the given
> tier. This is simple, but I worry it might lose the optimization of the
> current priority list, where devices are dynamically tracked as they
> become available or unavailable.
>
> So perhaps a better design is to make swap tier an object, and have
> each cgroup traverse only the priority list of the tiers it selected. I
> would like feedback on whether this design makes sense.

I feel that that has the risk of  premature optimization. I suggest
just going with the simplest bitmask check first then optimize as
follow up when needed. The bitmask check should still work with the
dynamic lists of swap devices but I doubt how much of a difference
that NUMA autobind makes now.

Chris