linux-kernel - Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACePvbUJSk23sH01msPcNiiiYw7JqWq_7xP1C7iBUN81nxJ36Q@mail.gmail.com>
Date: Tue, 26 Aug 2025 07:30:59 -0700
From: Chris Li <chrisl@...nel.org>
To: YoungJun Park <youngjun.park@....com>
Cc: Michal Koutný <mkoutny@...e.com>, 
	akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org, 
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev, 
	shikemeng@...weicloud.com, kasong@...cent.com, nphamcs@...il.com, 
	bhe@...hat.com, baohua@...nel.org, cgroups@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, gunho.lee@....com, 
	iamjoonsoo.kim@....com, taejoon.song@....com, 
	Matthew Wilcox <willy@...radead.org>, David Hildenbrand <david@...hat.com>, Kairui Song <ryncsn@...il.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
 cgroup-based swap priority

On Tue, Aug 26, 2025 at 5:57 AM YoungJun Park <youngjun.park@....com> wrote:
> > I think both cases are possible, I don't see a strong reason to limit
> > the flexibility when there is no additional cost. I expect the
> > restriction approach having similar complexity.
>
> In my use case, I think a restrictive inheritance model could
> be sufficient. My argument was mainly based on the fact that most cgroup
> resource distribution mechanisms usually follow a parent→child restrictive
> pattern. Through the review, I came to the view that I should adhere to the
> common behavior whenever possible.

I sleep on it a bit both literally and philosophically. I like to
point out that most of the cgroup control is about resource
constraints. For example, if you set a memory limit on the toplevel
cgroup. None of the children can go beyond that limit. So the child
usage does not make sense to go more than the parent usage. This is a
strict mathematical subset containing relationships. That is the
deeper reason behind the parent to child more restrictive pattern,
because mathematically it does not make sense otherwise.

The swap file control is different. What we really want is not about
the source limit. We have swap.max for that. The swap.tiers is about
QoS control. In the QoS point of view, there is not such a strict
subset containing relationships. The QoS of the parent and child can
be independent. Therefore, it is justifiable to have an anti-pattern
here. Because the root cause, the QoS is not a resource limit type of
the constain. It is more like a policy.

We shouldn't adhere to the common behavior just because other cgroup
interfaces do it. Here I believe we have a justifiable reason to break
away from it. Because it is a different type of control, QoS vs limit.

I think you touch on a very important question that might trigger a
big design change. Do we want to have a per tier swap.max? It will
specify not only whether this cgroup will enroll into this tier or
not. It also controls how much swap it allows to do in this cgroup.
The swap.max will follow the straight contain relationship. I would
need to think more about the relationship between swap.max and
swap.tiers. Initial intuition is that, we might end up with both per
tier swap.max, which control resource limit, it has subset contain
relationship. At the same time the swap.tiers which control QoS, it
does not follow the subset contained.

Need more sleep on that.

> Firstly(on RFC), I initially supported allowing parent/child inconsistency
> for flexibility, so I actually agree with your view regarding flexibility.
> For the examples you mentioned, I have no disagreement. I think my final
> understanding is aligned with yours.
>
> > Can you clarify what I need to reconsider? I have the very similar
> > bitmask idea as you describe now.
> > I am not a dictator. I just provide feedback to your usage case with
> > my reasoning.
> >
>
> Oh! I think you are a good reviewer :D
> Okay then, Let me explain my preference for numeric tiers in more detail.
> It seems we are aligned on the implementation strategy with bitmask,
> but I think our difference lies in the interface style — 'name' vs.
> 'numeric increase'."
>
> 1. A simple numeric interface makes the usage more straightforward.
>    Instead of '+/-' semantics, directly listing the numeric range feels
>    clearer and easier to use. For example:

I am not against it. There might be some small aspect of it here and
there to fine tune.

>      tier 1 (ram)
>      tier 2 (ssd)
>      tier 3 (hdd)
>      tier 4 (network device)
>      tier 5 (some device)
>      tier 6 (some device2)
>
>    cg1: echo 1-3  > memory.swap.tier (ram,ssd,hdd)

First of all, sorry about the pedantic, it should be "swap.tiers" just
to be consistent with the rest of the discussion.
Secondly, I just view names as an alias of the number. 1-3 is hard to
read what you want.
If we allow name as the alias, we can also do:
echo zram-hdd > memory.swap.tieres

It is exactly the same thing but much more readable.

>    cg1/cg2: 2-4,6  > memory.swap.tie (ssd,hdd,network device, somedevice 2, assuming non-subset is allowed)

echo ssd-network_device,some_device2 > memory.swap.tiers

See, same thing but much more readable what is your intention.

BTW, we should disallow space in tier names.

>
>    Tier specification can also be expressed simply as arrays of priority
>    ranges, which feels easy to understand.

The number to device mapping is just harder for humans to process. I
think the named alias makes sense. There is an advantage of using bash
to control it from sysfs rather than a dedicated user space swap tiers
control tool. You can still write a user space tool if you want. I
want the userspace tool optional.
It is the same thing under the hook anyway.

> 2. Since tiers are inherently ordered, numbering fits naturally and is
>    easier for users to accept.
>    In my view, assigning a name is mainly useful to distinguish between
>    otherwise 'indistinguishable' groups, but in this case, there is already
>    a clear distinction given by the different priorities which simply be
>    charaterized by increasing number.
>
> I understand your point that tier names may be more convenient for
> administrators, and I see the value in that. That was why I used the word
> "reconsider" — your feedback makes sense as well.

I still prefer to use the name myself. I am not against having numbers
if you prefer numbers more. You can configure it with numbers. I have
a small brain and I want to use names as aliases to config.

> I do not have a strong preference. It would be good to align after
> considering the pros and cons. I look forward to your thoughts."

The name is a huge usability improvement for bare mortals. I don't
want to maintain user space tools just to adjust swap.tiers IMHO. I am
not opposed to someone else having such tools. It needs to be
optional.

> > > There seem to be two possible choices:
> > >
> > > 1. Once a cgroup references a tier, modifying that tier should be
> > >    disallowed.
> >
> > Even modify a tier to cover more priority range but no swap device
> > falls in that additional range yet?
> > I think we should make the change follow the swap on/swap off
> > behavior. Once the swap device is swapped on, it can't change its tier
> > until it is swapped off again. when it is swapped off, there is no
> > cgroup on it. Notice the swap file belongs to which tier is not the
> > same as the priority range of the tier. You can modify the range and
> > reorder swap tiers as long as it is not causing swap on device jump to
> > a different tier.
> >
> > > 2. Allow tier re-definition even if cgroups are already referencing
> > >    it.
> >
> > You can still swap off even if cgroup is still using it.
> >
> > > Personally, I prefer option (1), since it avoids unexpected changes
> > > for cgroups that already rely on a particular tier definition.
> >
> > Swap off and on already have similar problems. We can't change the
> > priority when the swap device is swapon already. We can go through a
> > swap off to change it.
>
> I see your point. In practice, when tiers are already being referenced
> by cgroups, swap devices may come and go within those tiers. I think
> this can be considered a "natural" behavior, as swap management is
> usually performed explicitly by the administrator.
>
> From that perspective, I expect that unintended behavior is very
> unlikely to occur in real scenarios. So I am comfortable assuming this
> implicit behavior when reasoning about tier modifications.
>
> Thanks again for the clarification. With this, the overall picture
> feels much clearer. Once we reach alignment on the "named" vs. "numeric"
> tier interface, I plan to move forward with the patch work.

I consider that really trivial. Why can't we have both? The madvise
interface might only use numbers in the form of bit mask. Because that
is a C interface. For sysfs and administrative control, having a name
as an alias is so much better.

We do want to think about swap.tiers vs per tier swap.max. One idea
just brainstorming is that we can have an array of
"swap.<tiername>.max".
It is likely we need to have both kinds of interface. Because
"swap.<tiername>.max" specifies the inclusive child limit.
"swap.tiers" specifies this C group swap usage QoS. I might not use
hdd in this cgroup A, but the child cgroup B does. So A's hdd max
can't be zero.

The other idea is to specify a percentage for each tier of the
swap.max in "swap.tiers.max". That in place of "swap.<tiername>.max":
zram:30  sdd:70
That means zram max is "swap.max * 30%"   and ssd max is "swap.max *
70%". The number does not need to add up to 100, but can't be bigger
than 100.
The sum can be bigger than 100.

Need more sleep on it.

Chris