linux-kernel - Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF8kJuM4f2W6w29VcHY5mgXVMYmTF4yORKaFky6bCjS1xRek9Q@mail.gmail.com>
Date: Thu, 21 Aug 2025 13:39:51 -0700
From: Chris Li <chrisl@...nel.org>
To: YoungJun Park <youngjun.park@....com>
Cc: Michal Koutný <mkoutny@...e.com>, 
	akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org, 
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev, 
	shikemeng@...weicloud.com, kasong@...cent.com, nphamcs@...il.com, 
	bhe@...hat.com, baohua@...nel.org, cgroups@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, gunho.lee@....com, 
	iamjoonsoo.kim@....com, taejoon.song@....com, 
	Matthew Wilcox <willy@...radead.org>, David Hildenbrand <david@...hat.com>, Kairui Song <ryncsn@...il.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
 cgroup-based swap priority

On Wed, Aug 20, 2025 at 7:39 AM YoungJun Park <youngjun.park@....com> wrote:
>
> > > inclusion/exclusion semantics at the cgroup level. The reason I decided not to
> > > go with it is because it lacks flexibility — it cannot express arbitrary
> > > ordering. As noted above, it is impossible to represent arbitrary orderings,
> > > which is why I chose a per-device priority strategy instead.
> >
> > As said, arbitrary orders violate the swap entry LRU orders. You still
> > haven't given me a detailed technical reason why you need arbitrary
> > orders other than "I want a pony".
>
> I believe the examples I provided for arbitrary ordering can be considered
> a detailed technical reason.
> (You responded with Option 1 and Option 2.)

You still did not provide the detailed reason for it yet. I understand
you want the per cgroup swap device arbitrate ordering, that is a
solution not the root cause. I want to go one level deeper, why do you
want to have per cgroup swap device ordering. What is the
consideration to use the per cgroups list of the swap device order vs
other approach. For example, I want to preserve the fast swap device
mostly for jobs requiring fast response, I don't want to fill the fast
swap device with slow jobs' data. That is one of my guesses. Please
provide the background usage case and thinking process to get to that
conclusion.  Right now I am just guessing in the dark. You jump to the
conclusion of using aribitury cgroup swap device order as the only
solution too soon too quickly.

> > > The `swap.tier` concept also requires mapping priorities to tiers, creating
> > > per-cgroup tier objects, and so forth. That means a number of supporting
> > > structures are needed as well. While I agree it is conceptually well-defined,
> > > I don’t necessarily find it simpler than the per-device priority model.
> >
> > You haven't embraced the swap.tiers ideas to the full extent. I do see
> > it can be simpler if you follow my suggestion. You are imaging a
> > version using swap file priority data struct to implement the swap
> > tiers.
>
> Thank you for the detailed explanation. I think I understood the core points of this concept
> What I wrote was simply my interpretation — that it can be
> viewed as a well-defined extension of maintaining equal priority dependency
> together with inclusion/exclusion semantics. Nothing more and nothing less.

Good.


> > That is not what I have in mind. The tiers can be just one
> > integer to represent the set of tiers it enrolls and the default. If
> > you follow my suggestion and the design you will have a simpler series
> > in the end.
>
> Through this discussion my intention is to arrive at the best solution,

Ack.

> and I appreciate that you pointed out areas I should reconsider. If you,
> and other reviewers(If somebody gives opions of it, then it will be helpful)
> generally conclude that the tier concept is the right path,

That is why we should make it a more formal proposal, list out the
details to solicit feedback.

> I have a clear willingness to re-propose an RFC and patches
> based on your idea. In that case, since arbitrary ordering would not be
> allowed, I fully agree that the main swap selection logic would become
> simpler than my current implementation.

Thank you. If you can integrate the swap.tiers into your next series,
that would be great. I am worried that I might not have enough time to
implement it myself. I can certainly reason about it and point you in
the right direction as best as I can.

> > The problem is that you pollute your fast tier with very cold swap
> > entry data, that is to your disadvantage, because you will need to
> > swap back more from the slower tier.
> >
> > e.g. you have two pages. Swap entry A will get 2 swap faults, the swap
> > entry B will get 20 swap faults in the next 2 hours. B is hotter than
> > A. Let's say you have to store them one in zswap and the other in hdd.
> > Which one should you store in faster zswap? Obvious swap entry B.
> >
> > It will cause more problems when you flush the data to the lower tier.
> > You want to flush the coldest data first. Please read about the
> > history of zswap write back and what LRU problem it encountered. The
> > most recent zswap storing the incompressible pages series in the mail
> > list precisely driven by preserving the swap entry LRU order reason.
> >
> > You really should consider the effect on swap entry LRU ordering
> > before you design the per cgroup swap priority.
>
> Then I would like to ask a fundamental question about priority. Priority is
> a user interface, and the user has the choice. From the beginning, when the
> user sets priorities, there could be a scenario where the slower swap is

The Priority is just the global swap file ordering. Higher priority
use that swap device first.

> given a higher priority and the faster swap is given a lower one. That is
> possible. For example, if the faster device has a short lifetime, a real
> use case might be to consume the slower swap first for endurance, and only
> use the faster swap when unavoidable.

The idea of matching the faster swap with higher priority is just a
strategy to get better performance. It does not mean the priority ==
device speed.
If the user wants  to choose another priority strategy, maybe slower
performance, that is OK. They will get what they ask for.
We as  the kernel developer design the system as simply as possible to
achieve the good performance. Basically allow the good strategy to
happen easily. I wouldn't go overboard to change the meaning of
priority.

> In this case, logically from the LRU perspective there is no inversion of
> priority order, but in practice the slower device is filled first. That
> looks like degradation from a performance perspective — but it is exactly
> what the user intended.

You touch on a very good point. How to mix the global order and the
per memcg order.

> The swap tier concept appears to map priority semantics directly to service
> speed, so that higher priority always means faster service. This looks like
> it enforces the choice on the user(but it is opend).

Yes, and no. We should allow the better performance strategy to happen
easily while maintaining the code complexity low. That is what I am
trying to do here.

> Even with swap tiers, under the semantics you suggested, it is possible for
> a given cgroup to use only the slower tier. From that cgroup’s view there
> is no LRU inversion, but since the fast swap exists and is left unused, it
> could still be seen as an "inverse" in terms of usage.

Yes, if you put all the fast tier in one group. It needs to be
discussed case by case. That is exactly what I am asking for, what is
your usage case in mind that demands the per cgroup priority. We can
analyze the usage case and come up with creative solutions before we
jump to the conclusion. You can, for example, have divided the swap
space into two groups. A1 & A2 are both fast tiers. B1 & B2 are both
slow tiers. The one always follows to fill up A to B order using the
A1 and B1 group. The one wants to fill up the B first then A uses the
A2 and B2 group. 1 and 2 groups never mix. Then you can still maintain
LRU order when B2 fills up and starts to use A2, it will not upset the
A1 LRU because they are different swap devices on different groups.

If you give a more detailed usage situation, what challenge it faces.
I can give a more detailed solution using per cgroup priority vs
swap.tiers. That is why your usage case and reason is important.

> In summary, what I struggle to understand is that if the major assumption
> is that swap operation must always align with service speed, then even swap
> tiers can contradict it (since users may deliberately prefer the lower
> tier). In that case, wouldn’t the whole concept of letting users select swap
> devices by priority itself also become a problem?

Yes, if you keep them in one group and mix them. See the above 1 & 2
group option.

>
> > > I mentioned already on this mail: what swap tiers cannot do is arbitrary
> > > ordering. If ordering is fixed globally by tiers, some workloads that want to
> > > consume slower swap devices first (and reserve faster devices as a safety
> > > backend to minimize swap failures) cannot be expressed. This kind of policy
> > > requires arbitrary ordering flexibility, which is possible with per-device
> > > priorities but not with fixed tiers.
> >
> > Let's say you have fast tier A and slow tier B.
> >
> > Option 1) All swap entries go through the fast tier A first. As time
> > goes on, the colder swap entry will move to the end of the tier A LRU,
> > because there is no swap fault happening to those colder entries. If
> > you run out of space of  A, then you flush the end of the A to B. If
> > the swap fault does happen in the relative short period of time, it
> > will serve by the faster tier of A.
> >
> > That is a win compared to your proposal you want directly to go to B,
> > with more swap faults will be served by B compared to option 1).
> >
> > option 2) Just disable fast tier A in the beginning, only use B until
> > B is full. At some point B is full, you want to enable fast tier A.
> > Then it should move the head LRU from B into A. That way it still
> > maintains the LRU order.
> >
> > option 1) seems better than 2) because it serves more swap faults from
> > faster tier A.
>
> Option 1 does not really align with the usage scenario I had in mind,
> since it starts from the fast swap. Option 2 fits partially, but requires
> controlling when to enable the fast tier once full, and handling LRU
> movement — which adds complexity.

Why do you want to fill up the slower device first? You haven't
answered that question in detail. You are asking for a behavior
because you already determined you want this behavior. You need to go
deeper to the root cause why you want this behavior. What is your
ultimate goal? There might be other solutions addressing your ultimate
goal without using the behavior you choose.

> Your final suggestion of Option 1 seems consistent with your original
> objection: that the system design should fundamentally aim at performance
> improvement by making use of the fast swap first.

You did not give me a reason why option 1) violates your goal. I feel
that your goal is already fixated on the swap order. That is only the
solution of your thought process. You haven't shown us how you come to
that conclusion.

> > > And vswap possible usage: if we must consider vswap (assume we can select it
> > > like an individual swap device), where should it be mapped in the tier model?
> > > (see https://lore.kernel.org/linux-mm/CAMgjq7BA_2-5iCvS-vp9ZEoG=1DwHWYuVZOuH8DWH9wzdoC00g@mail.gmail.com/)
> >
> > The swap tires do not depend on vswap, you don't need to worry about that now.
>
> I initially understood vswap could also be treated as an
> identity selectable in the unified swap framework. If that were the case, I
> thought it would be hard to map vswap into the tier concept. Was that my
> misinterpretation?

Your series assumes adopting swap.tiers are likely to get in before
the vswap does. If that is the case, that problem is for vswap to
solve. Let's work on this incrementally one step at a time.

> > The per cgroup swap tiers integer bitmask is simpler than maintaining
> > a per cgroup order list. It might be the same complexity in your mind,
> > I do see swap tiers as the simpler one.
>
> I agree that from the perspective of implementing the main swap selection
> logic, tiers are simpler. Since arbitrary ordering is not allowed, a large
> part of the implementation complexity can indeed be reduced.

Exactly. We can start with this simple case and address the main
problem. If there is a special case we need to do the other order, we
can add them later. It makes sense to have a simple and clean solution
address the majority of the usage case first. The most common usage I
see is that, let latency sensitive jobs use faster tiers. Overflow to
a slower tier if necessary. The latency insensitive jobs just use the
slower tiers.

> Once again, thank you for your thoughtful comments and constructive feedback.

You are most welcome.


Chris