linux-kernel - Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACePvbVu7-s1BbXDD4Xk+vBk7my0hef5MBkecg1Vs6CBHMAm3g@mail.gmail.com>
Date: Sun, 31 Aug 2025 09:45:36 -0700
From: Chris Li <chrisl@...nel.org>
To: YoungJun Park <youngjun.park@....com>
Cc: Michal Koutný <mkoutny@...e.com>, 
	akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org, 
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev, 
	shikemeng@...weicloud.com, kasong@...cent.com, nphamcs@...il.com, 
	bhe@...hat.com, baohua@...nel.org, cgroups@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, gunho.lee@....com, 
	iamjoonsoo.kim@....com, taejoon.song@....com, 
	Matthew Wilcox <willy@...radead.org>, David Hildenbrand <david@...hat.com>, Kairui Song <ryncsn@...il.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
 cgroup-based swap priority

On Sun, Aug 31, 2025 at 6:53 AM YoungJun Park <youngjun.park@....com> wrote:
> > I think this will be one good question to ask feedback in the LPC MC
> > discussion.
>
> Great—looking forward to it at the LPC MC.

Ack

>
> > > At this point I feel the main directions are aligned, so I’ll proceed
> > > with an initial patch version. My current summary is:
> > >
> > > 1. Global interface to group swap priority ranges into tiers by name
> > >    (/sys/kernel/mm/swap/swaptier).
> > I suggest "/sys/kernel/mm/swap/tiers" just to make the file name look
>
> Yes, I also think "/sys/kernel/mm/swap/tiers" is a better fit.
>
> > different from the "swap.tiers" in the cgroup interface.
> > This former defines all tiers, giving tiers a name and range. The
> > latter enroll a subset of the tiers.
> >  I think the tier bit location does not have to follow the priority
> > order. If we allow adding a new tier, the new tier will get the next
> > higher bit. But the priority it split can insert into the middle thus
> > splitting an existing tier range. We do need to expose the tier bits
> > into the user space. Because for madvise()  to set tiers for VMA, it
> > will use bitmasks. It needs to know the name of the bitmask mapping,
> > I was thinking the mm/swap/tiers read back as one tier a line. show:
> > name, bitmask bit, range low, range high
>
> This part relates to my earlier point on runtime modification. My
> intention was to only allow setting the tiers globally, and to align
> bitmask with priority ranges. For example, an input like:
>
>   ssd:100, hdd:50, network_swap
>
> would translate into ranges as 100+ (bit0), 50–99 (bit1), and 0–49
> (bit2).
>
> From your description, I understand you are considering allowing
> additive updates, insertions and letting bitmask differ from the range priority. Is
> that correct? In that case we probably need a way to distinguish

That is right.

> between “add” and “reset”. Personally, I feel supporting only reset
> semantics would make the interface simpler, while still allowing add
> semantics when the full set is provided again.

The counterpart of "add" is "remove". There are two possible idea to explore:
1) only allow removing  a tier when all swap devices in that tier
range have been swapped off.
2) Remove the tier is removing a midpoint from the range. So the lower
tier automatically gets the range belonging to the tier that was
removed. Then optionally you can add another tier back in replacement
with different range boundaries. It effectively achieves replacement
as well. This approach does not require swap off the swap device. I
like it better. Because if you don't want the race window where the
swap device temporarily belongs to the lower tier, you can always swap
off the device in question before performing 2). so 2) can actually be
mixed with 1) as well.

>
> > > 2. Slow path allocation uses bitmask skipping; fast path uses per-cpu
> > >    tier cluster caches.
> > If the fast path fails, it will go through the slow path. So the slow
> > patch is actually a catch all.
>
> Do you mean that if the cluster does not belong to the desired tier in
> the fast path, it will skip and then fall back to the slow path? If so,

I am describing the existing swap cluster allocator behavior. In my
mind, we are using the existing cluster swap allocator code, with
constraints that only allow swap entry to be allocated from the
affected tier bitmask.

> the slow path would need to avoid inserting the cluster back into the
> cache, otherwise processes with a global swap view may end up using the
> wrong tier device(which must be referenced firstly assumed)
> Also cgroup which is tier set experience performance degradation
> because, there is possibility to try to alloc swap on slowpath most of the time.
> Wouldn’t this have performance implications?

I think we are mixing two different concepts. There are swap tiers
which decide which swap device to use. Then there is the swap
allocator to allocate a swap from the allowed list.

If we move to the swap tiers, the swap allocator needs to be swap
tiers aware. So it might move to per cgroup cache list or disable the
cache for the cgroup that hasn't been allocating for a while. The
allocation logic should be in the allocator, not in the swap tier
layer.

> I was thinking that maintaining per-tier per-cpu cluster caches would be
> simpler. Then each tier manages its own cluster cache, and we only need
> an array of per-cpu caches of size “max tiers”.

Again, let's not jump to premature optimizations. Do it the simple way
first, then let the measurement number guide us.
It might be per swap file has a cache not necessary per CPU. per-cpu x
per-tier the combination is too big, I am worried about caching too
much swap clusters. Each cluster is 2M.

>
> > > 3. Cgroup interface format modeled after cpuset.
> > I am not very familiar with the cpuset part of the interface. Maybe
> > you should explain that to the reader without using cpuset cgroup as a
> > reference.
>
> The similarity with cpuset is only in the text format. Like cpuset.cpus
> uses a comma-separated list and dash ranges (e.g. "0-4,6,8-10"), the
> swap tier interface would use the same style but with tier names. For
> example:
>   echo ssd-network_device,some_device2 > swap.tiers
> This makes it easy for users to read and modify at runtime, and keeps
> the interface consistent with existing cgroup controls.
> (Reference: https://docs.kernel.org/admin-guide/cgroup-v2.html, Cpuset Interface Files)

Thanks for the explanation. That sounds fine to me.

>
> > > 4. No inheritance between parent and child cgroup as a perspective of QoS
> > In my original proposal of "swap.tiers", if the default is not set on
> > this tier, it will look up the parent until the root memcg. ...
>
> My current thought is that it might be simpler to avoid inheritance
> entirely. Since this is a QoS interface rather than a resource limit
> mechanism, inheritance semantics may not be the best fit. I would prefer
> to always override based on what is explicitly set, and otherwise fall
> back to global swap. For example, input like:
>
>   swap.tiers = ssd,network_device,some_device2
>
> would always override the setting directly, without any parent lookup.

We DO want some parent level control. That is a real customer
requirement. The cons with your proposal is that, if you want to
change the whole set from top level cgroup to child cgroup, you need
to talk the hieratical chain to set each of the child cgroup. While
you are walking the child tree, there are races with more sub level
cgroup added to the tree. You will end up missing the newly created
cgroup. It is a mess.

It is much cleaner if we can allow the child cgroup to have the
default "swap.tiers" to be empty. Then you just need to set one value
to the top level parent cgroup and all child cgroup get it without
exception. The child can overwrite it if they want, default is getting
it from the parents.

The whole set of cgroup from top level including child can map into a
k8s pod. It is common to perform adjustments on the whole set
atomically. We should support it.

> > > 5. Runtime modification of tier settings allowed.
> > Need to clarify which tier setting? "swap.tiers" or /sys/kernel/mm/swap/tiers?
>
> My earlier comment was about allowing runtime modifications
> to the global /sys/kernel/mm/swap/tiers.

Ack.

> > > 6. Keep extensibility and broader use cases in mind.
> > >
> > > And some open points for further thought:
> > >
> > > 1. NUMA autobind
> > >    - Forbid tier if NUMA priorities exist, and vice versa?
> > >    - Should we create a dedicated NUMA tier?
> > >    - Other options?
> >
> > I want to verify and remove the NUMA autobind from swap later. That
> > will make things simpler for swap. I think the reason the NUMA swap
> > was introduced does not exist any more.
>
> Per your suggestion, the question of whether NUMA autobind
> is needed can be addressed in a dedicated discussion later.
> I look forward to it. :)

I was thinking of removing the NUMA autobind feature from the Linux
source code. Deleting codes, if the measurement number shows the NUMA
autobind does not make much of a difference any more. The performance
charistic have changed dramatically with the new cluster based swap
allocator. It is possible NUMA autobind does not make sense to justify
the complexity to exist in the source code.

> The NUMA autobind removal work.. possible directions could be:
>
>   - runtime toggle (default off),
>   - keep default on but gradually flip to default off,
>     eventually remove entirely.
>   - remove it. entirely.
>
> Not a proposal —just a thought
>
> In my current patch,
> tier and NUMA priorities are made mutually exclusive so they cannot be set together.

That is one more reason to remove NUMA priorities completely.

> > > 2. swap.tier.max
> > >    - percentage vs quantity, and clear use cases.
> > >   -  sketch concrete real-world scenarios to clarify usage
> >
> > Just don't do that. Ignore until there is a real usage case request.
>
> Agreed. It is better to defer until we see a concrete use case.

Ack.

> > > 4. Arbitrary ordering
> > >    - Do we really need it?
> > >    - If so, maybe provide a separate cgroup interface to reorder tiers.
> >
> > No for now. Need to answer how to deal with swap entry LRU order
> > inversion issue.
>
> Right, if we want to support this usage, your point about LRU order must
> definitely be addressed first.

Ack.

I think we are aligned.

Thanks

Chris