linux-kernel - Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACePvbW_Q6O2ppMG35gwj7OHCdbjja3qUCF1T7GFsm9VDr2e_g@mail.gmail.com>
Date: Sat, 30 Aug 2025 00:13:13 -0700
From: Chris Li <chrisl@...nel.org>
To: YoungJun Park <youngjun.park@....com>
Cc: Michal Koutný <mkoutny@...e.com>, 
	akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org, 
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev, 
	shikemeng@...weicloud.com, kasong@...cent.com, nphamcs@...il.com, 
	bhe@...hat.com, baohua@...nel.org, cgroups@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, gunho.lee@....com, 
	iamjoonsoo.kim@....com, taejoon.song@....com, 
	Matthew Wilcox <willy@...radead.org>, David Hildenbrand <david@...hat.com>, Kairui Song <ryncsn@...il.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
 cgroup-based swap priority

On Fri, Aug 29, 2025 at 9:05 PM YoungJun Park <youngjun.park@....com> wrote:
>
> Hi Chris,
>
> Thanks for the detailed feedback, and sorry for the late reply.

Not a problem at all. I have been pretty busy this week and don't have
much time for it either.

> > I think you touch on a very important question that might trigger a
> > big design change. Do we want to have a per tier swap.max? It will
> > specify not only whether this cgroup will enroll into this tier or
> > not. It also controls how much swap it allows to do in this cgroup.
> > The swap.max will follow the straight contain relationship. I would
> > need to think more about the relationship between swap.max and
> > swap.tiers. Initial intuition is that, we might end up with both per
> > tier swap.max, which control resource limit, it has subset contain
> > relationship. At the same time the swap.tiers which control QoS, it
> > does not follow the subset contained.
> >
> > Need more sleep on that.
>
> When I first ideated on this, I also considered per-device max values,
> with 0 meaning exclusion, to implement cases like a cgroup using only
> network swap. At that time the idea was to give each device its own
> counter, so setting it to 0 would imply exclusion. But this approach
> would effectively require maintaining per-device page counters similar
> to the existing swap.max implementation, and the relationship between
> these per-device counters and the global swap.max would need to be
> carefully defined. That made the design significantly heavier than the
> functionality I was aiming for, so I decided to drop it. I read your
> point more as a QoS extension, and I see it as complementary rather
> than a counter argument.

Yes, I slept on it for a few days. I reached a similar conclusion.
I am happy to share my thoughts:
1) FACT: We don't have any support to move data from swap device to
another swap device nowadays. It will not happen overnight. Talking
about those percentage allocation and maintaining those percentages is
super complicated. I question myself getting ahead of myself on this
feature.
2) FACT: I don't know if any real customers want this kind of
sub-cgroup swap per tier max adjustment. We should write imaginary
code for imaginary customers and reserve the real coding for the real
world customers. Most of the customers I know, including our company,
care most about the top level CGroup swap assignment. There are cases
that enable/disable per sub CGroup swap device, in the QoS sense not
the swap max usage sense.
I think this will be one good question to ask feedback in the LPC MC
discussion. Does anyone care about per tier max adjustment in the
cgroup? We should only consider that when we have real customers.

So I would shelf this per tier max adjustment and not spend any more time on it.

> > First of all, sorry about the pedantic, it should be "swap.tiers" just
> > to be consistent with the rest of the discussion.
> > Secondly, I just view names as an alias of the number. 1-3 is hard to
> > read what you want.
> > If we allow name as the alias, we can also do:
> > echo zram-hdd > memory.swap.tieres
> >
> > It is exactly the same thing but much more readable.
> >
> > >    cg1/cg2: 2-4,6  > memory.swap.tie (ssd,hdd,network device, somedevice 2, assuming non-subset is allowed)
> >
> > echo ssd-network_device,some_device2 > memory.swap.tiers
> >
> > See, same thing but much more readable what is your intention.
> >
> > BTW, we should disallow space in tier names.
>
> Ack—those spaces were only in my example; the implementation will reject
> spaces in tier names.
>
> I like the interface format you proposed, and I’ll move forward with an
> initial implementation using the name-based tier approach, dropping
> the numeric format.

I am glad you like it.

> > We do want to think about swap.tiers vs per tier swap.max. One idea
> > just brainstorming is that we can have an array of
> > "swap.<tiername>.max".
> > It is likely we need to have both kinds of interface. Because
> > "swap.<tiername>.max" specifies the inclusive child limit.
> > "swap.tiers" specifies this C group swap usage QoS. I might not use
> > hdd in this cgroup A, but the child cgroup B does. So A's hdd max
> > can't be zero.
> >
> > The other idea is to specify a percentage for each tier of the
> > swap.max in "swap.tiers.max": zram:30  sdd:70
> > That means zram max is "swap.max * 30%"   and ssd max is "swap.max *
> > 70%". The number does not need to add up to 100, but can't be bigger
> > than 100.
> > The sum can be bigger than 100.
> >
> > Need more sleep on it.
>
> I don’t have additional ideas beyond what you suggested at now. Since swap.max
> is defined in terms of quantity, my intuition is that tier.max should
> probably also be quantity-based, not percentage. As I mentioned earlier,
> I had also considered per-device max in the early RFC stage. The design
> was to introduce per-device counters, but that added substantial overhead
> and complexity, especially in reconciling them with the global swap.max
> semantics. For that reason I abandoned the idea, though I agree your
> suggestion makes sense in the context of QoS extension.

We are in agreement here. We should not touch it until we have a real
customer ask for it.

> At this point I feel the main directions are aligned, so I’ll proceed
> with an initial patch version. My current summary is:
>
> 1. Global interface to group swap priority ranges into tiers by name
>    (/sys/kernel/mm/swap/swaptier).
I suggest "/sys/kernel/mm/swap/tiers" just to make the file name look
different from the "swap.tiers" in the cgroup interface.
This former defines all tiers, giving tiers a name and range. The
latter enroll a subset of the tiers.
 I think the tier bit location does not have to follow the priority
order. If we allow adding a new tier, the new tier will get the next
higher bit. But the priority it split can insert into the middle thus
splitting an existing tier range. We do need to expose the tier bits
into the user space. Because for madvise()  to set tiers for VMA, it
will use bitmasks. It needs to know the name of the bitmask mapping,
I was thinking the mm/swap/tiers read back as one tier a line. show:
name, bitmask bit, range low, range high


> 2. Slow path allocation uses bitmask skipping; fast path uses per-cpu
>    tier cluster caches.
If the fast path fails, it will go through the slow path. So the slow
patch is actually a catch all.
> 3. Cgroup interface format modeled after cpuset.
I am not very familiar with the cpuset part of the interface. Maybe
you should explain that to the reader without using cpuset cgroup as a
reference.
> 4. No inheritance between parent and child cgroup as a perspective of QoS
In my original proposal of "swap.tiers", if the default is not set on
this tier, it will look up the parent until the root memcg. There are
two different tiers bitmask.
One is the local tier bitmask. The other is the effective bitmask.
If local tier bitmask sets the default, the effective tier bitmask ==
local tier bitmask
if local tier bitmask does not set default, The effective tier is
concatenation from parent to this memcg.

For example
a/swap.tiers: - +ssd # ssd only
a/b/swap.tiers: ""  # effective "- +ssh", also ssd only.
a/b/c : + -hdd # effective "- +ssd + -hdd", simplify as "+ -hdd"  The
'+' overwrite the default, anything before that can be ignored.

That way, if you are not setting anything in "swap.tiers" in the child
cgroup, that is the default behavior when you create a new cgroup.
Changing the parent can change all the child cgroup at the same time.

> 5. Runtime modification of tier settings allowed.
Need to clarify which tier setting? "swap.tiers" or /sys/kernel/mm/swap/tiers?

> 6. Keep extensibility and broader use cases in mind.
>
> And some open points for further thought:
>
> 1. NUMA autobind
>    - Forbid tier if NUMA priorities exist, and vice versa?
>    - Should we create a dedicated NUMA tier?
>    - Other options?

I want to verify and remove the NUMA autobind from swap later. That
will make things simpler for swap. I think the reason the NUMA swap
was introduced does not exist any more.

> 2. swap.tier.max
>    - percentage vs quantity, and clear use cases.
>   -  sketch concrete real-world scenarios to clarify usage

Just don't do that. Ignore until there is a real usage case request.

> 3. Possible future extensions to VMA-based tier usage.

Madvise(). That can be introduced earlier. I know a usage case for
that is the android. Android does not set every app as a cgroup. I
haven't checked for a while if that is still true.

> 4. Arbitrary ordering
>    - Do we really need it?
>    - If so, maybe provide a separate cgroup interface to reorder tiers.

No for now. Need to answer how to deal with swap entry LRU order
inversion issue.

Chris