linux-kernel - Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF8kJuPuOWUEMg6C9AnAA-mddgHRjuMVqURrbk6bUHxAmEvgFQ@mail.gmail.com>
Date: Fri, 5 Sep 2025 16:45:48 -0700
From: Chris Li <chrisl@...nel.org>
To: YoungJun Park <youngjun.park@....com>
Cc: Michal Koutný <mkoutny@...e.com>, 
	akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org, 
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev, 
	shikemeng@...weicloud.com, kasong@...cent.com, nphamcs@...il.com, 
	bhe@...hat.com, baohua@...nel.org, cgroups@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, gunho.lee@....com, 
	iamjoonsoo.kim@....com, taejoon.song@....com, 
	Matthew Wilcox <willy@...radead.org>, David Hildenbrand <david@...hat.com>, Kairui Song <ryncsn@...il.com>, 
	Wei Xu <weixugc@...gle.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
 cgroup-based swap priority

Sorry I was busy with some work related tasks and haven't got a chance
to reply to your last email.

On Thu, Sep 4, 2025 at 11:30 PM YoungJun Park <youngjun.park@....com> wrote:
>
> > Yes, that works. I would skip the "add" keyword.
> > Also I notice that we can allow " " in place of "," as a separator as well.
>
> Yes, supporting both " " and "," sounds convenient.

Ack.

>
> > Maybe instead of "remove hdd", just "-hdd" which is similar to how to
> > operate on swap.tiers.
>
> Agreed, "+" for add and "-" for remove is simpler.

Ack.

>
> > Oh, you mean the tier not listed in the above will be deleted.
> > I prefer the above option 1) then.
>
> That makes sense. Option 1) looks simplest overall.

Ack.

>
> > I don't understand what is this "removing" and "in stage"...
> > What is it trying to solve?
>
> That came from an idea to pre-add a new tier before removing another.
> But I now think returning an error on overlap is simpler, so staging is
> not needed.

Ack.

> > What do you mean by "visible"? Previous discussions haven't defined
> > what is visible vs invisible.
>
> By “visible” I meant a staged state becoming active. I realize the term
> was confusing. and it is not needed as I already explained.

Ack.

> > Trigger event to notify user space? Who consumes the event and what
> > can that user space tool do?
>
> I agree, sending user events is unnecessary. It is simpler to let tiers merge or
> be recreated and let the allocator handle it.

Ack.

> > If you remove the
> > swap tier. the range of that tier merges to the neighbour tier.  That
> > way you don't need to worry about the swap file already having an
> > entry in this tier you swap out.
>
> Should the configured mask simply be left as-is,
> even if (a) the same key is later reintroduced with a different order (e.g.,
> first → third), or (b) a merge causes the cgroup to use a lower tier it did not
Can you elaborate a)? I am not sure I understand it correctly.

Let's say we have 4 tier bit, firs two is default on/off.

bit 1: overwrite default on. bit 2: overwrite default off, bit3: zram
on, bit4, zram off

In this example, if we delete zram  how does the mask leave as it is
and follow a)?
I suspect you create a hole in the priority range and make swap
devices in that range not selectable.

> explicitly select?
> I infer that leaving the mask unchanged is acceptable and this concern
> may be unnecessary. if you consider this unnecessary, I am fine
> to follow the simpler direction you suggested.

We can leave them unassign as well. So after delete we have a hole in
the priority range. The swap device in that hole will not get more
swap out written to it, because it is not selectable from the
"swap.tiers". However the swap in can still happen on that device
until the device is swapped off. That sounds fine to me. I actually
like that better. Thank you for the suggestion.

> > If the fast path fails, it will go through the slow path. So the slow
> > path is actually a catch all.
>
> I think my intention may not have come across clearly. I was not trying
> to propose a new optimization, but to describe a direction that requires
> almost no changes from the current behavior. Looking back, I realize the
> ideas I presented may not have looked like small adjustments, even
> though that was my intent.

I see. We don't need to jump to implementation details yet. I can help
you resolve the swap allocator internals as well. Let's have the
user's visible behavior settle down first.

> As a simple approach I had in mind:
> - Fastpath can just skip clusters outside the selected tier.

That part will need to hack the current swap allocator regarding per
cpu cached swap.

> - Slowpath naturally respects the tier bitmask.

Ack.

> - The open point is how to treat the per-CPU cache.

That is what I consider the fast path modification in the swap cluster
based allocator. Let me think about it a bit. On the other hand, we
don't need to worry too much about the internals right now. We get the
effective tier selection bitmask and the swap cluster allocator just
honours it.

>
> If we insert clusters back, tiered and non-tiered cgroups may see
> low-priority clusters. If we skip insertion, tiered cgroups may lose
> caching benefits.

That is why I think the cluster cache which holds the current
allocation offset within a cluster might need to move to per swap tier
level or swap device level. Might be both.

> Chris, do you have another workable approach in mind here, or is this
> close to what you were also thinking?

I think this is close to what I am thinking, just some details haven't
ironed out.

> > In my original proposal, if a parent removes ssd then the child will
> > automatically get it as well.
>
> I now see you mean the effective mask is built by walking parents with local
> settings taking precedence, top to bottom, preferring the nearest local
> setting. Conceptually this yields two data structures: a local-setting mask and
> a runtime/effective mask. Does the above capture your intention, or is there
> anything else I should mention?

I talked to Wei about swap tiers a bit. Add him to the CC as well. He
made me realize that we need two level things in the cgroup
"swap.tiers".
What we have discussed is the  "+ssd" with possible parent lookup
behavior is a kind of operation. The end result of the operation is a
tier bit set to indicate which tier is allowed in the swap out. The
end result tier netmask doenot need to have default, default is part
of the local operation selection. For the operation we also need two
bits per tier so we can specify on/off/missing.

For the operation, each tier will need two bits, including the
default. One bit select this timer off, one bit select this tier on.
e.g. we have 16 tiers including the default, then all 16 tiers take up 32 bits.

For example:
a/swap.tiers: "- +ssd" # default off, only allow ssd.
a/b/swap.tiers: "-ssd" # don't want ssd. the rest inherit the parent.

> A few thoughts aligned with the above:
> - There is no separate “default setting” knob to control inheritance.

If there is no default knob. Every swap.tiers operation will need to
walk from the root to the current cgroup.
With default knob, the walk can terminate at the closest the parent
with the default overridden.

Also without default, how do you start with no swap then each child
has the option to add their own swap tiers?
That will force the parent to list and turn off each tier at the top
level memcg.

> - If unset locally, the effective value is derived by walking the cgroup
>   hierarchy from top to bottom.
Ack. In my mind, with or without local operation value, it needs to
walk the parent unless local has default override. Any local default
override will terminate the parent walking chain.

> - Once set locally, the local setting overrides everything inherited.
I need to think about that a bit. This is different from the original
inheritage appended from the parent operation syntax.
e.g. previously parent has "- +ssd" (nothing but ssd). child has
"+hdd", in the original proposal the effective is "- +ssd +hdd",
If you override the parent, then the child will become "- +hdd" (hdd
only). If the child wants to inherit the parent ssd behavior, it will
need to walk up the parent chain and set "- +ssd +hdd" instead of just
"+hdd".

It seems the append operation has the most flexible inherited
behavior, also more consistent. Need more discussion here.

> - There is no special “default tier” when tiers are absent.
If there is no default tier. Nothing is set in any of the
"swap.tiers", what is the behavior? Are all swap devices usable?
It seems we need the default in the swap.tiers operation. But the
final tier selection bitmask does not need default.

> - If nothing is set anywhere in the hierarchy, the initial mask is treated as
>   fully set at configuration time (selecting all tiers; global swap behavior).
>   However, reading the local file should return an empty value to indicate
>   “not set”.

Seems a bit arbitrary. Global "" is all and local "" is nothing.

> One idea is to precompute the effective mask at interface write time, since
> writes are rarer than swap I/O. You may have intended runtime recomputation
> instead—which approach do you prefer? This implies two masks: a local
> configuration mask and a computed effective mask.

We can have more discussion and give more examples here. I still like
the "- +ssd -zswap" as well as "+ -ssd -hdd", that kind of operation
syntax, which seems to have more consistent inherage story, more
flexible, and less make up rules. It can support incremental add as
well as incremental subtract. I can be convinced otherwise as well. We
just need to allow each tier to have two bits, one bit for off and one
bit for one.

Wei also raises one very important point. Because zswap is not tight
to a swap device. We might want a predefined tier bit to describe
zswap. e.g. the first tier bit is always given to zswap and the tier
name is always zswap, the priority range can be assigned from
mm/swap/tiers interface.

> And below is a spec summary I drafted, based on our discussion so far for
> note and alignment.
> (Some points in this reply remain unresolved, and there are additional TBD items.)
>
> * **Tier specification**
>   - Priority >= 0 range is divided into intervals, each identified by a
>     tier name. The full 0+ range must be covered.
Not necessarily true if we allow removal of the tier and generate
holes removed range as we discussed above. Unless I understand the
previous hole idea incorrectly.

>   - NUMA autobind and tiering are mutually exclusive.
For now. Ack.

>   - Max number of tiers = MAX_SWAPFILES (single swap device can also be
>     assigned as a tier).
That will make it a bit less than 32 tiers I think.

>   - A tier holds references when swap devices are assigned to its
>     priority range. Removal is only possible after swapoff clears the
>     references.

Ack.

>   - Cgroups referencing a tier do not hold references. If the tier is
>     removed, the cgroup’s configured mask is dropped. (TBD)

Ack the TBD part. That also means removing the tier from global tiers
list we will need to walk all cgroups and clear the tier bits to get
used. Because removing is a rare operation, maybe that is fine.

>   - Each tier has an order (tier1 is highest priority) and an internal
The order you mean swap device priority order or the tier internal bit order?
>     bit for allocation logic.
Ack the internal bit.

>   - Until it is set, there is no default tier.

Does that mean you can't do incremental add or incremental subtract tiers?

>     (may internally conceptually used? but not exported)

My suggestion now is "swap.tiers" is an operation rather than a
bitmask. It can include "default", Each tier can select on or off or
missing so 3 operation states. "default" tier has no name, if
specified, must be listed as the first in "swap.tiers"


> * **/sys/kernel/mm/swap/tiers**
>   - Read/write interface. Multiple entries allowed, delimiters: space or
>     comma.
>   - Format:
>       + "tier name":priority  → add (priority and above)
>       - "tier name"           → remove
Ack.

>     Note: a space must follow "+" or "-" before the tier name.
>   - Edge cases:
>       * If not all ranges are specified: input is accepted, but cgroups
>         cannot use incomplete ranges. (TBD)
>         e.g) echo "hdd:50" > /sys/kernel/mm/swap/tiers. (0~49 not specifeid)
Because removing the tier will generate holes in the priority range
anyway. 0-49 is not specified in the same as if a tier is there
previously then gets removed.

>       * Overlap with existing range: removal fails until all swap
>         devices in that range are swapped off.
Specifically a new tier priority landing in the middle of a tier
range, the new tier will split the range with the existing one.

>   - Output is sorted, showing tier order along with name, bit, and
>     priority range. (It may be more user-friendly to explicitly show
>     tier order. (TBD))
Ack.

>
> * **Cgroup interface**
>   - New files (under memcg): memory.swap.tier, memory.swap.tier.effective

I don't think we need two interface files. We can live with one just
"memory.swap.tiers"
We can list the local one first then effective one on the second line
or separate with "#"

cat a/memory.swap.tiers # will show:
- +ssd # ssd
cat a/b/memory.swap.tiers  will show:
+hdd # ssd hdd.  The ssd is from the parent a/memory.swap.tiers.

>     * Read/write: memory.swap.tier returns the local named set exactly
>       as configured (cpuset-like "+/-" tokens; space/comma preserved).
Ack.

>     * Read-only: memory.swap.tier.effective is computed from the cgroup
>       hierarchy, with the nearest local setting taking precedence
>       (similar to cpuset.effective). (TBD)
I suggest that as the second line or after the "#" separator.

>     * Example (named-set display, cpuset-like style)
>
>       Suppose tier order:
>         ssd (tier1), hdd (tier2), hdd2 (tier3), net (tier4)
>
>       Input:
>         echo "ssd-hdd, net" > memory.swap.tier

Ah, that will only allow incremental add, how about incremental subtract?

>       Readback:
>         cat memory.swap.tier
>           ssd-hdd, net     # exactly as configured (named set)

I think it is hard to store the range. I was thinking about the locals
as each tier takes 2 bits, on/off.
In that case if we want the exactly as configured, we need to store
the original expression as well.
Can we show the +ssd -hdd kind of evaluated tier operation there.
Store the original range expression will have complications when
removing the tier. It is complicate to clear the tier from the
original range.
effectively split the range. I was hoping the local internal storage
is the already processed bits of on/off.
Need more discussion here.
>
>         cat memory.swap.tier.effective
>           ssd-hdd, net     # same format; inherited/effective result
You might just spell out all the tier by name. The inherited behavior
will make it different from local, then the same set will have
different expressions to output it, using range or not. Just spell out
the tiers set might be simpler.
This is for debugging only anyway.
Consider joining this with the "swap.iter" output with a separation by
new line or "#".

>   - Inheritance: effective mask built by walking from parent to child,
>     with local settings taking precedence.

I was thinking that the parent "swap.tiers" would be appended with the
child "swap.tiers". Only override when the child uses default on/off.
Need more discussion here.

>   - Mask computation: precompute at interface write-time vs runtime
>     recomputation. (TBD; preference?)

Let's start with runtime. We can have a runtime and cached with
generation numbers on the toplevel. Any change will reset the top
level general number then the next lookup will drop the cache value
and re-evaluate.

>   - Syntax modeled after cpuset:
>       echo "ssd-hdd,net" > memory.swap.tier
>     Here “-” specifies a range and must respect tier order. Items
>     separated by “,” do not need to follow order and may overlap; they
>     are handled appropriately (similar to cpuset semantics).

Need discussion for incremental subtract. That syntax is incremental add.


> * **Swap allocation**
>   - Simple, workable implementation (TBD; to be revisited with
>     measurements).

Ack.

>
> I tried to summarize the discussion and my inline responses as clearly as
> possible. If anything is unclear or I misinterpreted something, please
> tell me and I’ll follow up promptly to clarify. If you have comments, I
> will be happy to continue the discussion. Hopefully this time our
> alignment will be clearer.

Yes, I leave comments on the part I think needs more discussion.

Thanks for the great summary.

Chris