[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aMQ+pzsINPPE8EQ6@yjaykim-PowerEdge-T330>
Date: Sat, 13 Sep 2025 00:39:19 +0900
From: YoungJun Park <youngjun.park@....com>
To: Chris Li <chrisl@...nel.org>
Cc: Michal Koutný <mkoutny@...e.com>,
akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org,
roman.gushchin@...ux.dev, shakeel.butt@...ux.dev,
muchun.song@...ux.dev, shikemeng@...weicloud.com,
kasong@...cent.com, nphamcs@...il.com, bhe@...hat.com,
baohua@...nel.org, cgroups@...r.kernel.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, gunho.lee@....com,
iamjoonsoo.kim@....com, taejoon.song@....com,
Matthew Wilcox <willy@...radead.org>,
David Hildenbrand <david@...hat.com>,
Kairui Song <ryncsn@...il.com>, Wei Xu <weixugc@...gle.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
cgroup-based swap priority
Hello Chris Li :D
> On Sun, Sep 7, 2025 at 10:40 AM YoungJun Park <youngjun.park@....com> wrote:
> >
> > Hi, Chris Li
> >
> > Thank you for your thoughtful and quick feedback.
> >
> > > > If you remove the
> > > > swap tier. the range of that tier merges to the neighbour tier. That
> > > > way you don't need to worry about the swap file already having an
> > > > entry in this tier you swap out.
> > >
> > > Should the configured mask simply be left as-is,
> > > even if (a) the same key is later reintroduced with a different order (e.g.,
> > > first → third), or (b) a merge causes the cgroup to use a lower tier it did not
> >
> > Let me clarify my concern with a concrete example.
> > Suppose:
> > 1. SSD → tier "A" (31–40), HDD → "B" (21–30), HDD2 → "C" (10–20), HDD3 → "D" (0–9)
> > 2. A cgroup uses tier "A"
> > 3. SSD is swapped off → tier "A" becomes a hole
>
> There is just no swap device in A. A still (31-40).
I implicitly meant that A would be removed. After step 3 SSD is swapped off,
tier A would be REMOVED via the interface.
>
> > 4. Tier "D" is removed
> > 5. Tier "A" is reassigned to range (0–9)
> > 6. Then a cgroup configured with "A" cannot actually use "A" (0~9)
>
> That would require each cgroup to hold a reference count of A.
> So A can't be re-assign while having a cgroup using A. in 5.
Yes, exactly.
I also think cgroups should hold references.
If we only rely on swap devices for references, complexity decreases, but
the issue I mentioned could still occur.
> > 7. Later a new tier "E" is added and assigned (31–40)
> > 8. A cgroup now configured with "E" refers to the same numeric range (31–40),
> > but the meaning has changed compared to when it used "A".
>
> I consider the user reassigning the same tier name with different
> ranges is a user error. They want to shoot themself in the foot, we
> can't stop them. Maybe we shouldn't even try to stop them. It does not
> make sense to complicate things just to prevent users from doing
> nonsense things. It has no additional complexity cost, sure.
I also disagree with reassigning the same name.
Assuming no cgroup tier references exist, I was referring to a situation where
"A" is removed and then reassigned.
In that case, the cgroup that originally specified "A" would be unable to
reference its intended tier range.
If we align on holding cgroup references, the concern I raised would be
cleanly resolved.
>
> >
> > This feels unintuitive. I would prefer invalidating the mask if the referenced
> > tier is removed, so stale references don't silently point to a different tier.
>
> If there is a life cycle of the invalidation? Forever does not seem to
> be good either. It will prevent user reuse the tier range even there
> is no cgroup referencing that before.
>
> If you want this kind of invalidation, I suggest just make a reference
> count on the "A", each cgroup that references "A" holds a reference
> count. It will be tricky to reference count the default on case
> though, basically every tier is reference counted.
Yes, I agree!
> > > I talked to Wei about swap tiers a bit. Add him to the CC as well. He
> > > made me realize that we need two level things in the cgroup
> > > "swap.tiers".
> > > ...
> > > For the operation, each tier will need two bits, including the
> > > default. One bit select this timer off, one bit select this tier on.
> > > e.g. we have 16 tiers including the default, then all 16 tiers take up 32 bits.
> >
> > My understanding is:
> >
> > Per tier (2-bit state)
> > - `+` → always on (bit 10)
> > - `-` → always off (bit 01)
> > - missing → inherit from parent (bit 00)
> > - `11` is invalid
>
> Right.
>
> >
> > Default tier
> > - `+` means inherit parent as the base
> > - `-` means start from zero (ignore parent)
> > - missing means (this is the part I want to confirm) nothing?
>
> + means override default to "on" for all, allow every tier. (starting
> for every tier)
> - means override default to "off" for all. disallow every tier.
> (starting from zero)
Ack.
So for the missing case, is it "nothing"?
Isn't it that we inherit from the parent and then incrementally add or
remove from there?
> > So in my view "default" is an **inheritance control knob**, whereas in your
> > explanation "default" is also a **special tier** with its own 2-bit state.
> > Is that the right reading?
>
> It is a knob to override all. Yes it is a special tier wild cast. If
> the tier was not mentioned using +tier_name or -tier_name, the tier
> uses the default on/off value. If a tier has more than one on/off
> operation, the last write wins (closer to the leaf node cgroup) wins.
>
> Therefore, if the cgroup has default override was set, there is no
> need to lookup the parent any more, it overrides every tier already.
> That provides a way for the child cgroup to overwrite all parent swap
> tiers selection.
>
> >
> > If my understanding is correct, I'm also happy to adopt the interface format
> > you proposed.
> >
> > Over the weekend I kept thinking about it, and your proposal looks like a
> > more flexible interface. It also has clear similarities to how cgroup
> > controllers are added, so the format seems acceptable to me.
>
> It is more flexible and I have a simple way to perform the parent
> lookup on/off evaluation with short cuts. I send that out in the other
> email.
Ack. Thank you once again for the detailed pseudo-code proposal.
> > I have one remaining concern about cgroup semantics.
> > The inheritance and resource model we're discussing seems to diverge
> > somewhat from existing cgroup v2 conventions. Since we've aligned that
> > this effectively acts as QoS control, it also makes me wonder whether we
> > should proactively propose a doc update to the "Resource Distribution
> > Models" section so the concept is explicitly covered. This may be me
> > being overcautious, so I'd appreciate your view.
>
> More documents is better. Yes it diverges from the existing V2
> convention as the parent contains the child. QoS is a policy, it is
> relative indpendent of parents, unlike the containing relationship.
Yes, I'll think about this more and share if I have something clearer to
discuss.
> > > Wei also raises one very important point. Because zswap is not tight
> > > to a swap device. We might want a predefined tier bit to describe
> > > zswap. e.g. the first tier bit is always given to zswap and the tier
> > > name is always zswap, the priority range can be assigned from
> > > mm/swap/tiers interface.
> >
> > Ack. Reserving a predefined tier bit for zswap makes sense.
> >
> > As a passing thought (not a strong proposal): a few common tiers (e.g., zswap,
> > ssd, hdd, remote) could be predefined and non-removable, with users inserting
> > custom ranges between or apart from them. For example, if an SSD tier is
> > predefined, `swapon` for SSD devices could be limited to that tier—this would
> > align with grouping by service speed and nudge users toward sensible configs.
>
> Then we will need to assign a fixed default range for them thus might
> make this more complex when people actually define their own tiers.
> I think the kernel should avoid making any default scheme on the user
> space swap tiers. Just like the software that manages the cgroup
> controls it.
>
> There are other complications. e.g. I have priority 3: first SSD drive
> 1, then priority 2: HDD, then priority 1: SSD driver 2.
> Pre-configured SSD tier will not be able to describe this, the first
> and third drive in the list is SSD. It create conflict in ordering. I
> think it is best to avoid defining any customer definable tier.
Ack. Thank you for your thoughts.
> > > > * **Tier specification**
> > > > - Priority >= 0 range is divided into intervals, each identified by a
> > > > tier name. The full 0+ range must be covered.
> > >
> > > Not necessarily true if we allow removal of the tier and generate
> > > holes removed range as we discussed above. Unless I understand the
> > > previous hole idea incorrectly.
> >
> > Ack. I prefer allowing holes, so we don't need to enforce covering the full
> > range simply.
> > (I had considered usage making full-range coverage coexist with holes,
> > but on reflection that doesn't seem necessary. complicated)
>
> Ack.
I'm reconsidering this part.
Previously, I thought about holes because we weren't holding cgroup's tier
references.
If there's no swap device but a cgroup references a tier, and we remove
that tier and let it merge, it would be very confusing for the cgroup
holding the reference. They implicitly uses other tier's the swap devices
(That's why I proposed cgroup mask invalidation + holes.)
If cgroups hold tier references, suppose a specific assigned middle tier
disappears.
Then, merging this tier seems more natural.
Since no swap device or cgroup is using the disappearing tier, merging
the tier's priority itself shouldn't be problematic.
>From the perspective of the tier being merged into, since there are no
swap devices in the merging tier, it won't end up using unintended swap
devices, so it shouldn't be an issue.
So I'm thinking maybe we should go back to full range coverage + tier
merge.
>
> Great.
>
> More agreement now.
Yes, we seem to agree on most things.
Thank you again for the review!
Best Regards
Youngjun Park
Powered by blists - more mailing lists