linux-kernel - Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACePvbV=OuxGTqoZvgwkx9D-1CycbDv7iQdKhqH1i2e8rTq9OQ@mail.gmail.com>
Date: Tue, 26 Aug 2025 01:19:57 -0700
From: Chris Li <chrisl@...nel.org>
To: YoungJun Park <youngjun.park@....com>
Cc: Michal Koutný <mkoutny@...e.com>, 
	akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org, 
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev, 
	shikemeng@...weicloud.com, kasong@...cent.com, nphamcs@...il.com, 
	bhe@...hat.com, baohua@...nel.org, cgroups@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, gunho.lee@....com, 
	iamjoonsoo.kim@....com, taejoon.song@....com, 
	Matthew Wilcox <willy@...radead.org>, David Hildenbrand <david@...hat.com>, Kairui Song <ryncsn@...il.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
 cgroup-based swap priority

On Sun, Aug 24, 2025 at 5:05 AM YoungJun Park <youngjun.park@....com> wrote:
>
> > How do you express the default tier who shall not name? There are
> > actually 3 states associated with default. It is not binary.
> > 1) default not specified: look up parent chain for default.
> > 2) default specified as on. Override parent default.
> > 3) default specified as off. Override parent default.
>
> As I understand, your intention is to define inheritance semantics depending
> on the default value, and allow children to override this freely with `-` and
> `+` semantics. Is that correct?

Right, the "+" and "-" need to place in the beginning without tier
name, then it is referring the default.

>
> When I originally proposed the swap cgroup priority mechanism, Michal Koutný
> commented that it is unnatural for cgroups if a parent attribute is not
> inherited by its child:
> (https://lore.kernel.org/linux-mm/rivwhhhkuqy7p4r6mmuhpheaj3c7vcw4w4kavp42avpz7es5vp@hbnvrmgzb5tr/)
Michal only said you need to provide ways for child cgroup to inherit
the parent.
The swap.tiers does provide such a mechanism. Just don't override the
default.  I would not go that far to ban the default overwrite. It is
useful no need to list every swap tier.

BTW, Michal, I haven't heard any feedback from you since I started the
swap.tiers discussion. If you have any concerns please do voice out.

> Therefore, my current thinking is:
> * The global swap setting itself is tier 1 (if nothing is configured).
> * If a cgroup has no setting:
>   - Top-level cgroups follow the global swap.
>   - Child cgroups follow their parent’s setting.
> * If a cgroup has its own setting, that setting is applied.
> (child cgroups can only select tiers that the parent has allowed.)

That is too restrictive. The most common case is just the parent
cgroup matters, the child uses the exact same setting as the parent.
However, if you want the child to be different from the parent, there
are two cases depending on your intention. Both can make sense.
1) The parent is more latency sensitive than the child. That way the
child will be more (slower) tired than the parent. Using more tiers is
slower, that is the inverted relationship. Your proposal does not
allow this?
2) The parent is latency tolerant and the child is latency sensitive.
In this case, the child will remove some swap files from the parent.
This is also a valid case, e.g. the parent is just a wrapper daemon
invoking the real worker as a child. The wrapper just does log
rotation and restarting the child group with a watchdog, it does not
need to be very latency sensitive, let say the watchdog is 1 hours.
The child is the heavy lifter and requires fast response.

I think both cases are possible, I don't see a strong reason to limit
the flexibility when there is no additional cost. I expect the
restriction approach having similar complexity.

> This seems natural because most cgroup resource distribution mechanisms follow
> a subset inheritance model.

I don't see a strong reason to make this kind of restriction yet. It
can go both ways. Depending on your viewpoint, having more swap tier
does not mean it is more powerful, it can be less powerful in the
sense that it can slow you down more.

> Thus, in my concept, there is no notion of a “default” value that controls
> inheritance.

Then you need to list all tiers to disable all. It would be error
prone if your tier list is long.
>
> > How are you going to store the list of ranges? Just a bitmask integer
> > or a list?
>
> They can be represented as increasing integers, up to 32, and stored as a
> bitmask.

Great, that is what I have in mind as well.

> > I feel the tier name is more readable. The number to which actual
> > device mapping is non trivial to track for humans.
>
> Using increasing integers makes it simpler for the kernel to accept a uniform
> interface format, it is identical to the existing cpuset interface, and it
> expresses the meaning of “tiers of swap by speed hierarchy” more clearly in my
> view.

Same.

>
> However, my feeling is still that this approach is clearer both in terms of
> implementation and conceptual expression. I would appreciate it if you could
> reconsider it once more. If after reconsideration you still prefer your

Can you clarify what I need to reconsider? I have the very similar
bitmask idea as you describe now.
I am not a dictator. I just provide feedback to your usage case with
my reasoning.

> direction, I will follow your decision.
>
> > I want to add another usage case into consideration. The swap.tiers
> > does not have to be per cgroup. It can be per VMA. [...]
>
> I understand this as a potential extension use case for swap.tier.
> I will keep this in mind when implementing. If I have further ideas here, I
> will share them for discussion.

That means the tiers definition needs to be global, outside of the cgroup.

> > Sounds fine. Maybe we can have "ssd:100 zswap:40 hdd" [...]
>
> Yes, this alignment looks good to me!
>
> > Can you elaborate on that. Just brainstorming, can we keep the
> > swap.tiers and assign NUMA autobind range to tier as well? [...]
>
> That is actually the same idea I had in mind for the NUMA use case.
> However, I doubt if there is any real workload using this in practice, so I
> thought it may be better to leave it out for now. If NUMA autobind is truly
> needed later, it could be implemented then.

I do see a possibility to just remove the NUMA autobind thing if the
default swap behavior is close enough. The recent swap allocator
change has made huge improvements in terms of lock contention and
using smaller locks. The NUMA autobind might not justify the
complexity now. I wouldn't spend too much effort in NUMA  for the MVP
of swap.tiers.

> This point can also be revisited during review or patch writing, so I will
> keep thinking about it.

Agree.

> > I feel that that has the risk of  premature optimization. I suggest
> > just going with the simplest bitmask check first then optimize as
> > follow up when needed. [...]
>
> Yes, I agree with you. Starting with the bitmask implementation seems to be
> the right approach.
>
> By the way, while thinking about possible implementation, I would like to ask
> your opinion on the following situation:
>
> Suppose a tier has already been defined and cgroups are configured to use it.
> Should we allow the tier definition itself to be modified afterwards?

If we can set it the first time, we should be able to set it the
second time. I don't recall such an example in the kernel parameter
can only be set once.


> There seem to be two possible choices:
>
> 1. Once a cgroup references a tier, modifying that tier should be disallowed.

Even modify a tier to cover more priority range but no swap device
falls in that additional range yet?
I think we should make the change follow the swap on/swap off
behavior. Once the swap device is swapped on, it can't change its tier
until it is swapped off again. when it is swapped off, there is no
cgroup on it. Notice the swap file belongs to which tier is not the
same as the priority range of the tier. You can modify the range and
reorder swap tiers as long as it is not causing swap on device jump to
a different tier.

> 2. Allow tier re-definition even if cgroups are already referencing it.

You can still swap off even if cgroup is still using it.

> Personally, I prefer option (1), since it avoids unexpected changes for
> cgroups that already rely on a particular tier definition.

Swap off and on already have similar problems. We can't change the
priority when the swap device is swapon already. We can go through a
swap off to change it.

Chris