[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aKROKZ9+z2oGUJ7K@yjaykim-PowerEdge-T330>
Date: Tue, 19 Aug 2025 19:12:57 +0900
From: YoungJun Park <youngjun.park@....com>
To: Chris Li <chrisl@...nel.org>
Cc: Michal Koutný <mkoutny@...e.com>,
akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org,
roman.gushchin@...ux.dev, shakeel.butt@...ux.dev,
muchun.song@...ux.dev, shikemeng@...weicloud.com,
kasong@...cent.com, nphamcs@...il.com, bhe@...hat.com,
baohua@...nel.org, cgroups@...r.kernel.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, gunho.lee@....com,
iamjoonsoo.kim@....com, taejoon.song@....com,
Matthew Wilcox <willy@...radead.org>,
David Hildenbrand <david@...hat.com>,
Kairui Song <ryncsn@...il.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
cgroup-based swap priority
On Sat, Aug 16, 2025 at 12:15:43PM -0700, Chris Li wrote:
At first, Thank you for detailed and fast feedback!
> I have not questioned the approach you can achieve with your goal. The
> real question is, is this the best approach to consider to merge into
Yes, I believe this could be the best approach.
I have compared several possible approaches before making this proposal. These
are the alternatives I reviewed in the RFC:
(https://lore.kernel.org/linux-mm/20250612103743.3385842-1-youngjun.park@lge.com/)
The part I mentions are as belows
> Evaluated Alternatives
> ======================
> 1. **Per-cgroup dedicated swap devices**
> - Previously proposed upstream [1]
> - Challenges in managing global vs per-cgroup swap state
> - Difficult to integrate with existing memory.limit / swap.max semantics
> 2. **Multi-backend swap device with cgroup-aware routing**
> - Considered sort of layering violation (block device cgroup awareness)
> - Swap devices are commonly meant to be physical block devices.
> - Similar idea mentioned in [2]
> 3. **Per-cgroup swap device enable/disable with swap usage contorl**
> - Expand swap.max with zswap.writeback usage
> - Discussed in context of zswap writeback [3]
> - Cannot express arbitrary priority orderings
> (e.g. swap priority A-B-C on cgroup C-A-B impossible)
> - Less flexible than per-device priority approach
> 4. **Per-namespace swap priority configuration**
> - In short, make swap namespace for swap device priority
> - Overly complex for our use case
> - Cgroups are the natural scope for this mechanism
In my view, the `swap.tier` proposal aligns quite well with alternative (3) that
I reviewed. That approach keeps the global priority assignment while adding
inclusion/exclusion semantics at the cgroup level. The reason I decided not to
go with it is because it lacks flexibility — it cannot express arbitrary
ordering. As noted above, it is impossible to represent arbitrary orderings,
which is why I chose a per-device priority strategy instead.
> the main line Linux kernel. Merging into the main line kernel has a
> very high bar. How is it compared to other alternative approaches in
> terms of technical merit and complexity trade offs.
Since you seem most concerned about complexity, I have been thinking more about
this point.
1. **Conceptual complexity**
The idea is simply to add a swap priority list per cgroup. This is
straightforward to understand. The more complicated part is NUMA priority
handling — but if that turns out to be too complex, we can drop it entirely
or adjust its semantics to reduce the complexity.
2. **Implementation complexity**
Could you clarify from which perspective you see implementation complexity as
problematic? I would like to know more specifically what part worries you.
The `swap.tier` concept also requires mapping priorities to tiers, creating
per-cgroup tier objects, and so forth. That means a number of supporting
structures are needed as well. While I agree it is conceptually well-defined,
I don’t necessarily find it simpler than the per-device priority model.
> Why would I trade a cleaner less complex approach for a more complex
> approach with technical deficiency not able to address (inverting swap
> entry LRU ordering)?
Could you elaborate on what exactly you mean by “inverting swap entry LRU order”?
Do you mean that because of per-cgroup priority differences, entries on the
global swap LRU list could become inconsistent when viewed from different
cgroups? If that is the case, could you explain more concretely what problems
such inconsistencies would cause? That would help me understand the concern
better.
> Let me clarify. LPC is not required to get your series merged. Giving
> a talk in LPC usually is an honor. It does not guarantee your series
> gets merged either. It certainly helps your idea get more exposure and
> discussion. You might be able to meet some maintainers in person. For
> me, it is nice to meet the person to whom I have been communicating by
> email. I was making the suggestion because it can be a good topic for
> LPC, and just in case you might enjoy LPC. It is totally for your
> benefit. Up to your decision, please don't make it a burden. It is
> not.
>
> If after your consideration, you do want to submit a proposal in LPC,
> you need to hurry though. The deadline is closing soon.
I see, thank you for the suggestion. I also think having the chance to discuss
this at LPC would be very beneficial for me. I will not see it as a burden —
if I decide to go forward, I will let you know right away (until this week).
> From the swap file point of view, when it needs to flush some data to
> the lower tiers, it is very hard if possible for swap file to maintain
> per cgroup LRU order within a swap file.
Could you explain in more detail why the flush operation is difficult in that
case? I would like to understand what the concrete difficulty is.
> It is much easier if all the swap entries in a swap file are in the
> same LRU order tier.
This is related to the same question above — I would appreciate a more
detailed explanation because it is not yet clear to me. Why is it easy?
> The swap.tiers idea is not a compromise, it is a straight win. Can you
> describe what per cgroup per swap file can do while swap.tiers can
> not?
I mentioned already on this mail: what swap tiers cannot do is arbitrary
ordering. If ordering is fixed globally by tiers, some workloads that want to
consume slower swap devices first (and reserve faster devices as a safety
backend to minimize swap failures) cannot be expressed. This kind of policy
requires arbitrary ordering flexibility, which is possible with per-device
priorities but not with fixed tiers.
And vswap possible usage: if we must consider vswap (assume we can select it
like an individual swap device), where should it be mapped in the tier model?
(see https://lore.kernel.org/linux-mm/CAMgjq7BA_2-5iCvS-vp9ZEoG=1DwHWYuVZOuH8DWH9wzdoC00g@mail.gmail.com/)
In my opinion, it cannot be mapped purely by service speed.
There are indeed situations where tiering by service speed is beneficial,
but I also believe priority-based ordering can capture the same intention
while also covering exceptional use cases.
So, I see the per-device priority approach as more general: it can represent
tier-based usage, but also more flexible policies that tiers alone cannot cover.
> It obviously will introduce new complexity. I want to understand the
> reason to justify the additional complexity before I consider such an
> approach.
I think that any new concept adds complexity, whether it is “swap.tier” or
per-device priority. If you could clarify more precisely what kind of
complexity you are most concerned about, I would be happy to give my detailed
thoughts in that direction.
Thank you again for your prompt and thoughtful feedback :). I will continue
thinking about this further while awaiting your reply.
Best regards,
Youngjun Park
Powered by blists - more mailing lists