[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aK2vIdU0szcu7smP@yjaykim-PowerEdge-T330>
Date: Tue, 26 Aug 2025 21:57:05 +0900
From: YoungJun Park <youngjun.park@....com>
To: Chris Li <chrisl@...nel.org>
Cc: Michal Koutný <mkoutny@...e.com>,
akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org,
roman.gushchin@...ux.dev, shakeel.butt@...ux.dev,
muchun.song@...ux.dev, shikemeng@...weicloud.com,
kasong@...cent.com, nphamcs@...il.com, bhe@...hat.com,
baohua@...nel.org, cgroups@...r.kernel.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, gunho.lee@....com,
iamjoonsoo.kim@....com, taejoon.song@....com,
Matthew Wilcox <willy@...radead.org>,
David Hildenbrand <david@...hat.com>,
Kairui Song <ryncsn@...il.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
cgroup-based swap priority
> > Therefore, my current thinking is:
> > * The global swap setting itself is tier 1 (if nothing is configured).
> > * If a cgroup has no setting:
> > - Top-level cgroups follow the global swap.
> > - Child cgroups follow their parent’s setting.
> > * If a cgroup has its own setting, that setting is applied.
> > (child cgroups can only select tiers that the parent has allowed.)
>
> That is too restrictive. The most common case is just the parent
> cgroup matters, the child uses the exact same setting as the parent.
> However, if you want the child to be different from the parent, there
> are two cases depending on your intention. Both can make sense.
> 1) The parent is more latency sensitive than the child. That way the
> child will be more (slower) tired than the parent. Using more tiers is
> slower, that is the inverted relationship. Your proposal does not
> allow this?
> 2) The parent is latency tolerant and the child is latency sensitive.
> In this case, the child will remove some swap files from the parent.
> This is also a valid case, e.g. the parent is just a wrapper daemon
> invoking the real worker as a child. The wrapper just does log
> rotation and restarting the child group with a watchdog, it does not
> need to be very latency sensitive, let say the watchdog is 1 hours.
> The child is the heavy lifter and requires fast response.
>
> I think both cases are possible, I don't see a strong reason to limit
> the flexibility when there is no additional cost. I expect the
> restriction approach having similar complexity.
In my use case, I think a restrictive inheritance model could
be sufficient. My argument was mainly based on the fact that most cgroup
resource distribution mechanisms usually follow a parent→child restrictive
pattern. Through the review, I came to the view that I should adhere to the
common behavior whenever possible.
Firstly(on RFC), I initially supported allowing parent/child inconsistency
for flexibility, so I actually agree with your view regarding flexibility.
For the examples you mentioned, I have no disagreement. I think my final
understanding is aligned with yours.
> Can you clarify what I need to reconsider? I have the very similar
> bitmask idea as you describe now.
> I am not a dictator. I just provide feedback to your usage case with
> my reasoning.
>
Oh! I think you are a good reviewer :D
Okay then, Let me explain my preference for numeric tiers in more detail.
It seems we are aligned on the implementation strategy with bitmask,
but I think our difference lies in the interface style — 'name' vs.
'numeric increase'."
1. A simple numeric interface makes the usage more straightforward.
Instead of '+/-' semantics, directly listing the numeric range feels
clearer and easier to use. For example:
tier 1 (ram)
tier 2 (ssd)
tier 3 (hdd)
tier 4 (network device)
tier 5 (some device)
tier 6 (some device2)
cg1: echo 1-3 > memory.swap.tier (ram,ssd,hdd)
cg1/cg2: 2-4,6 > memory.swap.tie (ssd,hdd,network device, somedevice 2, assuming non-subset is allowed)
Tier specification can also be expressed simply as arrays of priority
ranges, which feels easy to understand.
2. Since tiers are inherently ordered, numbering fits naturally and is
easier for users to accept.
In my view, assigning a name is mainly useful to distinguish between
otherwise 'indistinguishable' groups, but in this case, there is already
a clear distinction given by the different priorities which simply be
charaterized by increasing number.
I understand your point that tier names may be more convenient for
administrators, and I see the value in that. That was why I used the word
"reconsider" — your feedback makes sense as well.
I do not have a strong preference. It would be good to align after
considering the pros and cons. I look forward to your thoughts."
> > There seem to be two possible choices:
> >
> > 1. Once a cgroup references a tier, modifying that tier should be
> > disallowed.
>
> Even modify a tier to cover more priority range but no swap device
> falls in that additional range yet?
> I think we should make the change follow the swap on/swap off
> behavior. Once the swap device is swapped on, it can't change its tier
> until it is swapped off again. when it is swapped off, there is no
> cgroup on it. Notice the swap file belongs to which tier is not the
> same as the priority range of the tier. You can modify the range and
> reorder swap tiers as long as it is not causing swap on device jump to
> a different tier.
>
> > 2. Allow tier re-definition even if cgroups are already referencing
> > it.
>
> You can still swap off even if cgroup is still using it.
>
> > Personally, I prefer option (1), since it avoids unexpected changes
> > for cgroups that already rely on a particular tier definition.
>
> Swap off and on already have similar problems. We can't change the
> priority when the swap device is swapon already. We can go through a
> swap off to change it.
I see your point. In practice, when tiers are already being referenced
by cgroups, swap devices may come and go within those tiers. I think
this can be considered a "natural" behavior, as swap management is
usually performed explicitly by the administrator.
>From that perspective, I expect that unintended behavior is very
unlikely to occur in real scenarios. So I am comfortable assuming this
implicit behavior when reasoning about tier modifications.
Thanks again for the clarification. With this, the overall picture
feels much clearer. Once we reach alignment on the "named" vs. "numeric"
tier interface, I plan to move forward with the patch work.
Best Regards
Youngjun Park
Powered by blists - more mailing lists