linux-kernel - Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACePvbXAXbxqRi3_OoiSJKVs0dzuC-021AVaTkE3XOSx7FWvXQ@mail.gmail.com>
Date: Mon, 1 Sep 2025 15:40:51 -0700
From: Chris Li <chrisl@...nel.org>
To: YoungJun Park <youngjun.park@....com>
Cc: Michal Koutný <mkoutny@...e.com>, 
	akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org, 
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev, 
	shikemeng@...weicloud.com, kasong@...cent.com, nphamcs@...il.com, 
	bhe@...hat.com, baohua@...nel.org, cgroups@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, gunho.lee@....com, 
	iamjoonsoo.kim@....com, taejoon.song@....com, 
	Matthew Wilcox <willy@...radead.org>, David Hildenbrand <david@...hat.com>, Kairui Song <ryncsn@...il.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
 cgroup-based swap priority

On Mon, Sep 1, 2025 at 9:21 AM YoungJun Park <youngjun.park@....com> wrote:
>
> Overall, the alignment looks good. Among the three points you suggested,
> I agree with (3) cgroup inheritance. I would like to continue the
> discussion on (1) swap tier lifecycle and (2) allocation logic.

Sure.

>
> 1. swap tier lifecycle
> 2. allocation logic
> 3. cgroup inheritance
>
> > > This part relates to my earlier point on runtime modification. My
> > > intention was to only allow setting the tiers globally, and to align
> > > bitmask with priority ranges. For example, an input like:
> > >
> > >   ssd:100, hdd:50, network_swap
> > >
> > > would translate into ranges as 100+ (bit0), 50–99 (bit1), and 0–49
> > > (bit2).
> > >
> > > From your description, I understand you are considering allowing
> > > additive updates, insertions and letting bitmask differ from the range priority. Is
> > > that correct? In that case we probably need a way to distinguish
> >
> > That is right.
>
> Yes, I agree that add/remove semantics can be supported,
> But it was not fully clear whether there was agreement on the full set
> format, I wanted to state explicitly that my preference is to require
> the full set format for simplicity. That said, if staged insertion and
> removal are considered useful, one possible approach is:
>
> (side note! explanation of the interface was somewhat
> descriptive, which may not have been fully clear. If this explanation is
> sufficient to establish the general direction, I will aim to present it
> more concretely in the patch series. Otherwise, I can provide a more
> detailed explanation in a follow-up email.)
>
>   echo "add ssd:100,hdd:50,network_swap" >/sys/kernel/mm/swap/tiers
Yes, that works. I would skip the "add" keyword.
Also I notice that we can allow " " in place of "," as a separator as well.
Let's call it option 1).

>   echo "add new:80"  > /sys/kernel/mm/swap/tiers
>   echo "remove hdd" > /sys/kernel/mm/swap/tiers

Maybe instead of "remove hdd", just "-hdd" which is similar to how to
operate on swap.tiers.

>
> Alternatively, separate files for add, remove, and show could be used to
> represent staged operations:
>
>   echo "ssd:100,hdd:50,network_swap" >/sys/kernel/mm/swap/tiers/add
>   echo "new:80"  > /sys/kernel/mm/swap/tiers/add
>   echo "hdd" > /sys/kernel/mm/swap/tiers/remove

Let's call it option 2)
I feel that we don't need to have both add and remove interface files.
The above use just one file is simpler. I like the above option 1)
with modification: skip "add" keyboard and use "-" to replace "remove"
keyword. I can give more examples if needed.

Don't like option 2).

> When using the fullset approach:
>
>   ssd:100(bit0), hdd:50(bit1), network_device(bit2)

Oh, the fullset you mean you want to specify the bit for the tier name.
Why? The bit selection can happen automatically thus reducing the
change users give to an invalid bit value, e.g. that bit is already
used. Please educate me what usage case you do need to specify the
bits while auto bit selection is not good enough?

> If we remove the ssd layer and add a new tier:
>
>   echo new:80,hdd:50,network_device >/sys/kernel/mm/swap/tiers

Option 3), full set specification.

Oh, you mean the tier not listed in the above will be deleted.
I prefer the above option 1) then.

Notice there is race when you remove stuff, there will be newly added
the tier can be accidentally removed as well. Again, what is the usage
case you can't do with option 1)?

> The show output could display staged state (imaginary output for understanding):
>
>   ssd:100(bit0), new:80(bit3, in stage), hdd:50(bit1, removing), network_device(bit2)

I don't understand what is this "removing" and "in stage", that is
some extra complexity.
What is it trying to solve?

>
> After the hdd tier reference drops to zero:

Drop to zero, how? By swap off or expecting the app using the swap
exits or fault in all swapped pages in that tier?

>
>   ssd:100(bit0), new:80(bit3),  network_device(bit2)

For display we can also make each tier take one line at a time.

>
> > > between “add” and “reset”. Personally, I feel supporting only reset
> > > semantics would make the interface simpler, while still allowing add
> > > semantics when the full set is provided again.
> >
> > The counterpart of "add" is "remove". There are two possible ideas to explore:
> > 1) only allow removing a tier when all swap devices in that tier
> > range have been swapped off.
> > 2) Remove the tier by removing a midpoint from the range. The lower
> > tier automatically gets the range belonging to the tier that was
> > removed. Optionally, you can add another tier back in replacement
> > with different range boundaries. This effectively achieves replacement
> > as well. This approach does not require swapping off the swap device. I
> > like it better. If you want to avoid the race window where the
> > swap device temporarily belongs to the lower tier, you can always swap
> > off the device before performing 2). So 2) can be mixed with 1) as well.
>
> I have already explained this from the perspective of option 2 mixed
> with option 1. Let me clarify one point:
>
> If...
> ssd:100, hdd:50, network_device.
> Insertion above 100 becomes visible after ssd removal,

What do you mean by "visible"? Previous discussions haven't defined
what is visible vs invisible. If you use a new term, please define it.
Is visible available to add a new tier on?

> Insertion above 50 becomes visible after hdd removal,
> Insertion above 0 becomes visible after network_device removal.
>
> It means that as long as the tier exists, the referenced priority ranges
> cannot be overridden.
>
> And Regarding swap_tier object lifecycle:
>
> A swap_tier should not be deleted until all devices in the tier are
> swapped off (As you said, references are held). Therefore, cgroups that reference a
> tier should also hold a reference. Silently dropping a tier is problematic
> from a cgroup perspective.

Nope, that is too much reference to track. Each swap device belongs to
only one tier at a time. The swap device will have a pointer or bit
mask of that tier and bump up that tier's reference count when a swap
device swaps on.

> If we allow this, the implementation should behave as follows as I think:
> If a swap_tier is removed, the cgroup’s tier configuration could be
> marked invalid. This should trigger an event to the cgroup to notify
> user space.

Trigger event to notify user space? Who consumes the event and what
can that user space tool do? That seems to be the extra complexity we
should avoid. Just stick with the swap off behavior. If you remove the
swap tier. the range of that tier merges to the neighbour tier.  That
way you don't need to worry about the swap file already having an
entry in this tier you swap out.

Please don't keep proposing new interfaces for the sake of it. Try to
get the feature done with an absolutely minimal interface introduced.

> > > > > 2. Slow path allocation uses bitmask skipping; fast path uses per-cpu
> > > > >    tier cluster caches.
> > > > If the fast path fails, it will go through the slow path. So the slow
> > > > patch is actually a catch all.
> > >
> > > Do you mean that if the cluster does not belong to the desired tier in
> > > the fast path, it will skip and then fall back to the slow path? If so,
> >
> > I am describing the existing swap cluster allocator behavior. In my
> > mind, we are using the existing cluster swap allocator code, with
> > constraints that only allow swap entry to be allocated from the
> > affected tier bitmask.
> >
> > > the slow path would need to avoid inserting the cluster back into the
> > > cache, otherwise processes with a global swap view may end up using the
> > > wrong tier device(which must be referenced firstly assumed)
> > > Also cgroup which is tier set experience performance degradation
> > > because, there is possibility to try to alloc swap on slowpath most of the time.
> > > Wouldn’t this have performance implications?
> >
> > I think we are mixing two different concepts. There are swap tiers
> > which decide which swap device to use. Then there is the swap
> > allocator to allocate a swap from the allowed list.
> >
> > If we move to the swap tiers, the swap allocator needs to be swap
> > tiers aware. So it might move to per-cgroup cache list or disable the
> > cache for cgroups that haven't been allocating for a while. The
> > allocation logic should be in the allocator, not in the swap tier
> > layer.
> >
> > > I was thinking that maintaining per-tier per-cpu cluster caches would be
> > > simpler. Then each tier manages its own cluster cache, and we only need
> > > an array of per-cpu caches of size “max tiers”.
> >
> > Again, let's not jump to premature optimizations. Do it the simple way
> > first, then let the measurement numbers guide us.
> > It might be per swap file has a cache not necessary per CPU. per-cpu x
> > per-tier the combination is too big, I am worried about caching too
> > much swap clusters. Each cluster is 2M.
>
> You suggested maintaining per-swap-device cluster caches. As an
> alternative, I would like to suggest a per-device per-CPU cache
> approach, which could be simpler from an integration perspective. It

Each device belongs to one tier, that is the same number as per tier x
per cpu or even more. You might end up catching too many clusters.
Morden server has a very high cpu count. I would stay away per CPU if
I can afford it or find other solutions. Per cpu x another high number
is just a liability waiting for bad things to happen.

> would fit more naturally with the existing allocation logic, remove tier
> awareness from the allocator, and should not introduce functional

The current swap allocator is using a per CPU cluster cache. That
cached cluster is only going to belong to one swap device. When the
swap tier is introduced, the swap allocator behavior needs to be
changed anyway. Because we don't have to "fill this device full"
before moving to the next device.

> differences in behavior. Moreover, since SSD devices are likely to be
> concentrated in only a small number of tiers (with one being the "best"
> tier), the number of clusters actually cached at any time would not be
> large. I am not presenting this as the ultimate solution, but rather as
> a simple and reasonably practical approach to consider. I agree that we
> should revisit and evaluate this approach further.

Let's revisit it when we get more detail. This is internal
implementation detail anyway. The users of swap.tiers don't have to
know about it.

>
> > We DO want some parent level control. That is a real customer
> > requirement. The cons with your proposal is that, if you want to
> > change the whole set from top-level cgroup to child cgroups, you need
> > to traverse the hierarchical chain to set each child cgroup. While
> > walking the child tree, more sub-level cgroups may be added, and
> > you could miss newly created cgroups. It becomes a mess.
> >
> > It is much cleaner if we allow the child cgroup to have the default
> > "swap.tiers" empty. Then you just need to set one value at the top-level
> > parent cgroup, and all child cgroups inherit it automatically. A child
> > can overwrite it if desired; by default it inherits from its parent.
> >
> > The whole set of cgroups from top-level including children can map
> > into a Kubernetes pod. It is common to perform adjustments on the
> > whole set atomically. We should support it.
>
> Okay I will adopt default inheritance for pod-level and similar use cases. A
> child cgroup inherits the nearest ancestor’s mask upon creation. If it

You are missing the fact that you need to track two sets of the tier mask.
the "swap.ters" is a local tiername you need to track. Default is empty.

The runtime tier set is the local tier mask walk to the parent, and
collect all parent local bitmasks and aggregate into an effective set.
The effective tier bitmask is used as the swap allocator.

What you purpose still has the same problem at:
1) parent "- +ssd"
2) child "' # same as parent.

If we do what your proposal is. Child will have "ssd" on the bit mask
at creation
When parents remove the "ssd". the child will keep having the "ssd".

In  my original proposal, if a parent removes ssd then the child will
automatically get it as well.

I am not happy that you keep introducing change to my proposal and
introduce the same buggy behavior again and again. Please respect my
time, I am spending my long weekend writing an email to you. Please
don't introduce the same bug again just for the sake of changing
behavior. Be careful and think through what you are proposing.

Chris