linux-kernel - Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF8kJuPUouN4c6V-CaG7_WQUAvRxBg02WRxsMtL56_YTdTh1Jg@mail.gmail.com>
Date: Tue, 19 Aug 2025 17:52:39 -0700
From: Chris Li <chrisl@...nel.org>
To: YoungJun Park <youngjun.park@....com>
Cc: Michal Koutný <mkoutny@...e.com>, 
	akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org, 
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev, 
	shikemeng@...weicloud.com, kasong@...cent.com, nphamcs@...il.com, 
	bhe@...hat.com, baohua@...nel.org, cgroups@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, gunho.lee@....com, 
	iamjoonsoo.kim@....com, taejoon.song@....com, 
	Matthew Wilcox <willy@...radead.org>, David Hildenbrand <david@...hat.com>, Kairui Song <ryncsn@...il.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
 cgroup-based swap priority

On Tue, Aug 19, 2025 at 3:13 AM YoungJun Park <youngjun.park@....com> wrote:
>
> On Sat, Aug 16, 2025 at 12:15:43PM -0700, Chris Li wrote:
>
> At first, Thank you for detailed and fast feedback!
>
> > I have not questioned the approach you can achieve with your goal. The
> > real question is, is this the best approach to consider to merge into
>
> Yes, I believe this could be the best approach.
> I have compared several possible approaches before making this proposal. These
> are the alternatives I reviewed in the RFC:
> (https://lore.kernel.org/linux-mm/20250612103743.3385842-1-youngjun.park@lge.com/)
> The part I mentions are as belows
>
> > Evaluated Alternatives
> > ======================
> > 1. **Per-cgroup dedicated swap devices**
> >    - Previously proposed upstream [1]
> >    - Challenges in managing global vs per-cgroup swap state
> >    - Difficult to integrate with existing memory.limit / swap.max semantics
> > 2. **Multi-backend swap device with cgroup-aware routing**
> >    - Considered sort of layering violation (block device cgroup awareness)
> >    - Swap devices are commonly meant to be physical block devices.
> >    - Similar idea mentioned in [2]
> > 3. **Per-cgroup swap device enable/disable with swap usage contorl**
> >    - Expand swap.max with zswap.writeback usage
> >    - Discussed in context of zswap writeback [3]
> >    - Cannot express arbitrary priority orderings
> >      (e.g. swap priority A-B-C on cgroup C-A-B impossible)
> >    - Less flexible than per-device priority approach
> > 4. **Per-namespace swap priority configuration**
> >    - In short, make swap namespace for swap device priority
> >    - Overly complex for our use case
> >    - Cgroups are the natural scope for this mechanism
>
> In my view, the `swap.tier` proposal aligns quite well with alternative (3) that
> I reviewed. That approach keeps the global priority assignment while adding

Not the same as option 3. swap.tier has one level in direction for the
tier class. It does not directly operate on swap files. That level of
indirection allows swap files to rotate within the same tier. I expect
it to have very few tires so all the swap tires can fit a simple
bitmask, e.g. one 32 bit integer per cgroup is good enough. Assume we
allow 31 tiers. We can have less than 32 swap files, 31 tiers should
be more than enough.

> inclusion/exclusion semantics at the cgroup level. The reason I decided not to
> go with it is because it lacks flexibility — it cannot express arbitrary
> ordering. As noted above, it is impossible to represent arbitrary orderings,
> which is why I chose a per-device priority strategy instead.

As said, arbitrary orders violate the swap entry LRU orders. You still
haven't given me a detailed technical reason why you need arbitrary
orders other than "I want a pony".

> > the main line Linux kernel. Merging into the main line kernel has a
> > very high bar. How is it compared to other alternative approaches in
> > terms of technical merit and complexity trade offs.
>
> Since you seem most concerned about complexity, I have been thinking more about
> this point.
>
> 1. **Conceptual complexity**
>    The idea is simply to add a swap priority list per cgroup. This is
>    straightforward to understand. The more complicated part is NUMA priority
>    handling — but if that turns out to be too complex, we can drop it entirely
>    or adjust its semantics to reduce the complexity.

The swap priority list is a list. The swap tiers is just a set less
than32 total tiers. Can be expressed in one integer bitmask.

> 2. **Implementation complexity**
>    Could you clarify from which perspective you see implementation complexity as
>    problematic? I would like to know more specifically what part worries you.

Your 4 patch series total lines of code? I expect the swap tiers can
be much shorter, because it does not deal with arbitrate orders.

> The `swap.tier` concept also requires mapping priorities to tiers, creating
> per-cgroup tier objects, and so forth. That means a number of supporting
> structures are needed as well. While I agree it is conceptually well-defined,
> I don’t necessarily find it simpler than the per-device priority model.

You haven't embraced the swap.tiers ideas to the full extent. I do see
it can be simpler if you follow my suggestion. You are imaging a
version using swap file priority data struct to implement the swap
tiers. That is not what I have in mind. The tiers can be just one
integer to represent the set of tiers it enrolls and the default. If
you follow my suggestion and the design you will have a simpler series
in the end.

> > Why would I trade a cleaner less complex approach for a more complex
> > approach with technical deficiency not able to address (inverting swap
> > entry LRU ordering)?
>
> Could you elaborate on what exactly you mean by “inverting swap entry LRU order”?
> Do you mean that because of per-cgroup priority differences, entries on the
> global swap LRU list could become inconsistent when viewed from different
> cgroups?

Exactly.

>If that is the case, could you explain more concretely what problems
> such inconsistencies would cause? That would help me understand the concern

The problem is that you pollute your fast tier with very cold swap
entry data, that is to your disadvantage, because you will need to
swap back more from the slower tier.

e.g. you have two pages. Swap entry A will get 2 swap faults, the swap
entry B will get 20 swap faults in the next 2 hours. B is hotter than
A. Let's say you have to store them one in zswap and the other in hdd.
Which one should you store in faster zswap? Obvious swap entry B.

It will cause more problems when you flush the data to the lower tier.
You want to flush the coldest data first. Please read about the
history of zswap write back and what LRU problem it encountered. The
most recent zswap storing the incompressible pages series in the mail
list precisely driven by preserving the swap entry LRU order reason.

You really should consider the effect on swap entry LRU ordering
before you design the per cgroup swap priority.

> > From the swap file point of view, when it needs to flush some data to
> > the lower tiers, it is very hard if possible for swap file to maintain
> > per cgroup LRU order within a swap file.
>
> Could you explain in more detail why the flush operation is difficult in that
> case? I would like to understand what the concrete difficulty is.
>
> > It is much easier if all the swap entries in a swap file are in the
> > same LRU order tier.
>
> This is related to the same question above — I would appreciate a more
> detailed explanation because it is not yet clear to me. Why is it easy?

Because I don't need to alter the list ording. When it enumerates the
same list of swap files, it just needs to check if the current swap
file is excluded by the swap.tiers integer bitmask. Each swap file can
cache a bit which tier it is belonging to, for example.

>
> > The swap.tiers idea is not a compromise, it is a straight win. Can you
> > describe what per cgroup per swap file can do while swap.tiers can
> > not?
>
> I mentioned already on this mail: what swap tiers cannot do is arbitrary
> ordering. If ordering is fixed globally by tiers, some workloads that want to
> consume slower swap devices first (and reserve faster devices as a safety
> backend to minimize swap failures) cannot be expressed. This kind of policy
> requires arbitrary ordering flexibility, which is possible with per-device
> priorities but not with fixed tiers.

Let's say you have fast tier A and slow tier B.

Option 1) All swap entries go through the fast tier A first. As time
goes on, the colder swap entry will move to the end of the tier A LRU,
because there is no swap fault happening to those colder entries. If
you run out of space of  A, then you flush the end of the A to B. If
the swap fault does happen in the relative short period of time, it
will serve by the faster tier of A.

That is a win compared to your proposal you want directly to go to B,
with more swap faults will be served by B compared to option 1).

option 2) Just disable fast tier A in the beginning, only use B until
B is full. At some point B is full, you want to enable fast tier A.
Then it should move the head LRU from B into A. That way it still
maintains the LRU order.

option 1) seems better than 2) because it serves more swap faults from
faster tier A.

> And vswap possible usage: if we must consider vswap (assume we can select it
> like an individual swap device), where should it be mapped in the tier model?
> (see https://lore.kernel.org/linux-mm/CAMgjq7BA_2-5iCvS-vp9ZEoG=1DwHWYuVZOuH8DWH9wzdoC00g@mail.gmail.com/)

The swap tires do not depend on vswap, you don't need to worry about that now.

> In my opinion, it cannot be mapped purely by service speed.
> There are indeed situations where tiering by service speed is beneficial,
> but I also believe priority-based ordering can capture the same intention
> while also covering exceptional use cases.

The above two options should be able to cover what you want.

> So, I see the per-device priority approach as more general: it can represent
> tier-based usage, but also more flexible policies that tiers alone cannot cover.

Not worth while to break the swap entry LRU order. We can do it in a
way keeping the LRU order. You will be serving the more swap fault
from the fast tier which is an overall win.

> > It obviously will introduce new complexity. I want to understand the
> > reason to justify the additional complexity before I consider such an
> > approach.
>
> I think that any new concept adds complexity, whether it is “swap.tier” or
> per-device priority. If you could clarify more precisely what kind of
> complexity you are most concerned about, I would be happy to give my detailed
> thoughts in that direction.

I see no real justification to break the swap entry LRU order yet.
Will my solution 1) or 2) work for you in your example?

The per cgroup swap tiers integer bitmask is simpler than maintaining
a per cgroup order list. It might be the same complexity in your mind,
I do see swap tiers as the simpler one.

Chris