[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF8kJuPnaJi=aKFwEknoh-eNgUPoje29EiKApmaWur+GqBGc0g@mail.gmail.com>
Date: Tue, 9 Sep 2025 17:14:51 -0700
From: Chris Li <chrisl@...nel.org>
To: YoungJun Park <youngjun.park@....com>
Cc: Michal Koutný <mkoutny@...e.com>, 
	akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org, 
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev, 
	shikemeng@...weicloud.com, kasong@...cent.com, nphamcs@...il.com, 
	bhe@...hat.com, baohua@...nel.org, cgroups@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, gunho.lee@....com, 
	iamjoonsoo.kim@....com, taejoon.song@....com, 
	Matthew Wilcox <willy@...radead.org>, David Hildenbrand <david@...hat.com>, Kairui Song <ryncsn@...il.com>, 
	Wei Xu <weixugc@...gle.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
 cgroup-based swap priority
On Sun, Sep 7, 2025 at 10:40 AM YoungJun Park <youngjun.park@....com> wrote:
>
> Hi, Chris Li
>
> Thank you for your thoughtful and quick feedback.
>
> > > If you remove the
> > > swap tier. the range of that tier merges to the neighbour tier.  That
> > > way you don't need to worry about the swap file already having an
> > > entry in this tier you swap out.
> >
> > Should the configured mask simply be left as-is,
> > even if (a) the same key is later reintroduced with a different order (e.g.,
> > first → third), or (b) a merge causes the cgroup to use a lower tier it did not
>
> Let me clarify my concern with a concrete example.
> Suppose:
> 1. SSD → tier "A" (31–40), HDD → "B" (21–30), HDD2 → "C" (10–20), HDD3 → "D" (0–9)
> 2. A cgroup uses tier "A"
> 3. SSD is swapped off → tier "A" becomes a hole
There is just no swap device in A. A still (31-40).
> 4. Tier "D" is removed
> 5. Tier "A" is reassigned to range (0–9)
> 6. Then a cgroup configured with "A" cannot actually use "A" (0~9)
That would require each cgroup to hold a reference count of A.
So A can't be re-assign while having a cgroup using A. in 5.
> 7. Later a new tier "E" is added and assigned (31–40)
> 8. A cgroup now configured with "E" refers to the same numeric range (31–40),
>    but the meaning has changed compared to when it used "A".
I consider the user reassigning the same tier name with different
ranges is a user error. They want to shoot themself in the foot, we
can't stop them. Maybe we shouldn't even try to stop them. It does not
make sense to complicate things just to prevent users from doing
nonsense things. It has no additional complexity cost, sure.
>
> This feels unintuitive. I would prefer invalidating the mask if the referenced
> tier is removed, so stale references don’t silently point to a different tier.
If there is a life cycle of the invalidation? Forever does not seem to
be good either. It will prevent user reuse the tier range even there
is no cgroup referencing that before.
If you want this kind of invalidation, I suggest just make a reference
count on the "A", each cgroup that references "A" holds a reference
count. It will be tricky to reference count the default on case
though, basically every tier is reference counted.
> > I think my intention may not have come across clearly. I was not trying
> > to propose a new optimization, but to describe a direction that requires
> > almost no changes from the current behavior. Looking back, I realize the
> > ideas I presented may not have looked like small adjustments, even
> > though that was my intent.
> >
> > I see. We don't need to jump to implementation details yet. I can help
> > you resolve the swap allocator internals as well. Let's have the
> > user's visible behavior settle down first.
>
> Ack. Let’s settle user-visible semantics first and defer allocator internals.
> We can revisit per-CPU cluster cache handling as a lower-priority topic when we
> move to patchwork.
Ack.
>
> > I talked to Wei about swap tiers a bit. Add him to the CC as well. He
> > made me realize that we need two level things in the cgroup
> > "swap.tiers".
> > ...
> > For the operation, each tier will need two bits, including the
> > default. One bit select this timer off, one bit select this tier on.
> > e.g. we have 16 tiers including the default, then all 16 tiers take up 32 bits.
>
> My understanding is:
>
> Per tier (2-bit state)
> - `+` → always on (bit 10)
> - `-` → always off (bit 01)
> - missing → inherit from parent (bit 00)
> - `11` is invalid
Right.
>
> Default tier
> - `+` means inherit parent as the base
> - `-` means start from zero (ignore parent)
> - missing means (this is the part I want to confirm) nothing?
+ means override default to "on" for all, allow every tier. (starting
for every tier)
- means override default to "off" for all. disallow every tier.
(starting from zero)
> So in my view “default” is an **inheritance control knob**, whereas in your
> explanation “default” is also a **special tier** with its own 2-bit state.
> Is that the right reading?
It is a knob to override all. Yes it is a special tier wild cast. If
the tier was not mentioned using +tier_name or -tier_name, the tier
uses the default on/off value. If a tier has more than one on/off
operation, the last write wins (closer to the leaf node cgroup) wins.
Therefore, if the cgroup has default override was set, there is no
need to lookup the parent any more, it overrides every tier already.
That provides a way for the child cgroup to overwrite all parent swap
tiers selection.
>
> If my understanding is correct, I’m also happy to adopt the interface format
> you proposed.
>
> Over the weekend I kept thinking about it, and your proposal looks like a
> more flexible interface. It also has clear similarities to how cgroup
> controllers are added, so the format seems acceptable to me.
It is more flexible and I have a simple way to perform the parent
lookup on/off evaluation with short cuts. I send that out in the other
email.
> I have one remaining concern about cgroup semantics.
> The inheritance and resource model we’re discussing seems to diverge
> somewhat from existing cgroup v2 conventions. Since we’ve aligned that
> this effectively acts as QoS control, it also makes me wonder whether we
> should proactively propose a doc update to the “Resource Distribution
> Models” section so the concept is explicitly covered. This may be me
> being overcautious, so I’d appreciate your view.
More documents is better. Yes it diverges from the existing V2
convention as the parent contains the child. QoS is a policy, it is
relative indpendent of parents, unlike the containing relationship.
> > Wei also raises one very important point. Because zswap is not tight
> > to a swap device. We might want a predefined tier bit to describe
> > zswap. e.g. the first tier bit is always given to zswap and the tier
> > name is always zswap, the priority range can be assigned from
> > mm/swap/tiers interface.
>
> Ack. Reserving a predefined tier bit for zswap makes sense.
>
> As a passing thought (not a strong proposal): a few common tiers (e.g., zswap,
> ssd, hdd, remote) could be predefined and non-removable, with users inserting
> custom ranges between or apart from them. For example, if an SSD tier is
> predefined, `swapon` for SSD devices could be limited to that tier—this would
> align with grouping by service speed and nudge users toward sensible configs.
Then we will need to assign a fixed default range for them thus might
make this more complex when people actually define their own tiers.
I think the kernel should avoid making any default scheme on the user
space swap tiers. Just like the software that manages the cgroup
controls it.
There are other complications. e.g. I have priority 3: first SSD drive
1, then priority 2: HDD, then priority 1: SSD driver 2.
Pre-configured SSD tier will not be able to describe this, the first
and third drive in the list is SSD. It create conflict in ordering. I
think it is best to avoid defining any customer definable tier.
> > > * **Tier specification**
> > >   - Priority >= 0 range is divided into intervals, each identified by a
> > >     tier name. The full 0+ range must be covered.
> >
> > Not necessarily true if we allow removal of the tier and generate
> > holes removed range as we discussed above. Unless I understand the
> > previous hole idea incorrectly.
>
> Ack. I prefer allowing holes, so we don’t need to enforce covering the full
> range simply.
> (I had considered usage making full-range coverage coexist with holes,
> but on reflection that doesn’t seem necessary. complicated)
Ack.
>
> > >   - Each tier has an order (tier1 is highest priority) and an internal
> >
> > The order you mean swap device priority order or the tier internal bit order?
>
> I meant the order implied by the priority ranges. In the interface I suggested,
> the `-` operator specifies ordered ranges, so a notion of tier order matters.
> With your format this may not be needed or not that important.
I see. the '-' you mean the '-' between two tiers. The leading '-'
will mean "off".
> > >   - Until it is set, there is no default tier.
> >
> > Does that mean you can't do incremental add or incremental subtract tiers?
>
> > >     (may internally conceptually used? but not exported)
> >
> > My suggestion now is "swap.tiers" is an operation rather than a
> > bitmask. It can include "default", Each tier can select on or off or
> > missing so 3 operation states. "default" tier has no name, if
> > specified, must be listed as the first in "swap.tiers"
>
> When I said “default tier,” I meant a conceptual tier that covers the full
> priority range when nothing is specified. From your reply, your “default”
> sounds closer to a *default value* (inheritance control) rather than a
> standalone tier. Did I get that right?
It is the wild cast tier that controls on/off  for tier names that
haven't been mentioned with "+/-" "on/off" operation.
e.g.:
"- +ssd -hdd", that means default is off, turn on ssd and turn off hdd
(don't need to say that, default is already off).
But if there is also a zswap tier, it is not in the on/off operation
list. It is default to off because the leading "-".
> > >     Note: a space must follow "+" or "-" before the tier name.
> > >   - Edge cases:
> > >       * If not all ranges are specified: input is accepted, but cgroups
> > >         cannot use incomplete ranges. (TBD)
> > >         e.g) echo "hdd:50" > /sys/kernel/mm/swap/tiers. (0~49 not specifeid)
> >
> > Because removing the tier will generate holes in the priority range
> > anyway. 0-49 is not specified in the same as if a tier is there
> > previously then gets removed.
>
> As discussed above, we’re allowing holes, so we can accept inputs that don’t
> cover the full range.
>
> > >       * Overlap with existing range: removal fails until all swap
> > >         devices in that range are swapped off.
> >
> > Specifically a new tier priority landing in the middle of a tier
> > range, the new tier will split the range with the existing one.
>
> If swapoff is complete but removal has not occurred and a new tier comes in,
> we can allow splitting. If a tier reference is still held, splitting should not
> be allowed. A corner case: a tier spans 50–100 but only priorities 55 and 60
> have active swap; inserting a split at 70 (no active refs) — to keep rules
> simple, I’d still **not** accept the split while any references exist anywhere
> in the original range.
In that case you want refcount the tiers by cgroup that reference it.
I am open to suggestion, I haven't give this too deep thought.
>
> > > * **Cgroup interface**
> > >   - New files (under memcg): memory.swap.tier, memory.swap.tier.effective
> >
> > I don't think we need two interface files. We can live with one just
> > "memory.swap.tiers"
> > We can list the local one first then effective one on the second line
> > or separate with "#"
>
> Ack. One file is simpler; show local then effective. For now, a newline
> separator looks clearer than “#”.
Ack.
>
> > >   - Syntax modeled after cpuset:
> > >       echo "ssd-hdd,net" > memory.swap.tier
> >
> > Need discussion for incremental subtract. That syntax is incremental add.
>
> I think the format you suggest (+, -)
> is appropriate from a flexibility perspective.
Ack, thank you.
>
> > > * **Swap allocation**
> > >   - Simple, workable implementation (TBD; to be revisited with
> > >     measurements).
>
> Ack.
Great.
More agreement now.
Chris
Powered by blists - more mailing lists
 
