linux-kernel - Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aL3Dav4RLvtLliYC@yjaykim-PowerEdge-T330>
Date: Mon, 8 Sep 2025 02:39:54 +0900
From: YoungJun Park <youngjun.park@....com>
To: Chris Li <chrisl@...nel.org>
Cc: Michal Koutný <mkoutny@...e.com>,
	akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org,
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev,
	muchun.song@...ux.dev, shikemeng@...weicloud.com,
	kasong@...cent.com, nphamcs@...il.com, bhe@...hat.com,
	baohua@...nel.org, cgroups@...r.kernel.org, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, gunho.lee@....com,
	iamjoonsoo.kim@....com, taejoon.song@....com,
	Matthew Wilcox <willy@...radead.org>,
	David Hildenbrand <david@...hat.com>,
	Kairui Song <ryncsn@...il.com>, Wei Xu <weixugc@...gle.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
 cgroup-based swap priority

Hi, Chris Li 

Thank you for your thoughtful and quick feedback.

> > If you remove the
> > swap tier. the range of that tier merges to the neighbour tier.  That
> > way you don't need to worry about the swap file already having an
> > entry in this tier you swap out.
>
> Should the configured mask simply be left as-is,
> even if (a) the same key is later reintroduced with a different order (e.g.,
> first → third), or (b) a merge causes the cgroup to use a lower tier it did not

Let me clarify my concern with a concrete example.
Suppose:
1. SSD → tier "A" (31–40), HDD → "B" (21–30), HDD2 → "C" (10–20), HDD3 → "D" (0–9)
2. A cgroup uses tier "A"
3. SSD is swapped off → tier "A" becomes a hole
4. Tier "D" is removed
5. Tier "A" is reassigned to range (0–9)
6. Then a cgroup configured with "A" cannot actually use "A" (0~9)
7. Later a new tier "E" is added and assigned (31–40)
8. A cgroup now configured with "E" refers to the same numeric range (31–40),
   but the meaning has changed compared to when it used "A".

This feels unintuitive. I would prefer invalidating the mask if the referenced
tier is removed, so stale references don’t silently point to a different tier.

> I think my intention may not have come across clearly. I was not trying
> to propose a new optimization, but to describe a direction that requires
> almost no changes from the current behavior. Looking back, I realize the
> ideas I presented may not have looked like small adjustments, even
> though that was my intent.
>
> I see. We don't need to jump to implementation details yet. I can help
> you resolve the swap allocator internals as well. Let's have the
> user's visible behavior settle down first.

Ack. Let’s settle user-visible semantics first and defer allocator internals.
We can revisit per-CPU cluster cache handling as a lower-priority topic when we
move to patchwork.

> I talked to Wei about swap tiers a bit. Add him to the CC as well. He
> made me realize that we need two level things in the cgroup
> "swap.tiers".
> ...
> For the operation, each tier will need two bits, including the
> default. One bit select this timer off, one bit select this tier on.
> e.g. we have 16 tiers including the default, then all 16 tiers take up 32 bits.

My understanding is:

Per tier (2-bit state)
- `+` → always on (bit 10)
- `-` → always off (bit 01)
- missing → inherit from parent (bit 00)
- `11` is invalid

Default tier
- `+` means inherit parent as the base
- `-` means start from zero (ignore parent)
- missing means (this is the part I want to confirm) nothing?

So in my view “default” is an **inheritance control knob**, whereas in your
explanation “default” is also a **special tier** with its own 2-bit state.
Is that the right reading?

If my understanding is correct, I’m also happy to adopt the interface format
you proposed.

Over the weekend I kept thinking about it, and your proposal looks like a
more flexible interface. It also has clear similarities to how cgroup
controllers are added, so the format seems acceptable to me.

I have one remaining concern about cgroup semantics.
The inheritance and resource model we’re discussing seems to diverge
somewhat from existing cgroup v2 conventions. Since we’ve aligned that
this effectively acts as QoS control, it also makes me wonder whether we
should proactively propose a doc update to the “Resource Distribution
Models” section so the concept is explicitly covered. This may be me
being overcautious, so I’d appreciate your view.

> Wei also raises one very important point. Because zswap is not tight
> to a swap device. We might want a predefined tier bit to describe
> zswap. e.g. the first tier bit is always given to zswap and the tier
> name is always zswap, the priority range can be assigned from
> mm/swap/tiers interface.

Ack. Reserving a predefined tier bit for zswap makes sense.

As a passing thought (not a strong proposal): a few common tiers (e.g., zswap,
ssd, hdd, remote) could be predefined and non-removable, with users inserting
custom ranges between or apart from them. For example, if an SSD tier is
predefined, `swapon` for SSD devices could be limited to that tier—this would
align with grouping by service speed and nudge users toward sensible configs.

> > * **Tier specification**
> >   - Priority >= 0 range is divided into intervals, each identified by a
> >     tier name. The full 0+ range must be covered.
>
> Not necessarily true if we allow removal of the tier and generate
> holes removed range as we discussed above. Unless I understand the
> previous hole idea incorrectly.

Ack. I prefer allowing holes, so we don’t need to enforce covering the full
range simply. 
(I had considered usage making full-range coverage coexist with holes, 
but on reflection that doesn’t seem necessary. complicated)

> >   - Each tier has an order (tier1 is highest priority) and an internal
>
> The order you mean swap device priority order or the tier internal bit order?

I meant the order implied by the priority ranges. In the interface I suggested,
the `-` operator specifies ordered ranges, so a notion of tier order matters.
With your format this may not be needed or not that important.

> >   - Until it is set, there is no default tier.
>
> Does that mean you can't do incremental add or incremental subtract tiers?

> >     (may internally conceptually used? but not exported)
>
> My suggestion now is "swap.tiers" is an operation rather than a
> bitmask. It can include "default", Each tier can select on or off or
> missing so 3 operation states. "default" tier has no name, if
> specified, must be listed as the first in "swap.tiers"

When I said “default tier,” I meant a conceptual tier that covers the full
priority range when nothing is specified. From your reply, your “default”
sounds closer to a *default value* (inheritance control) rather than a
standalone tier. Did I get that right?

> >     Note: a space must follow "+" or "-" before the tier name.
> >   - Edge cases:
> >       * If not all ranges are specified: input is accepted, but cgroups
> >         cannot use incomplete ranges. (TBD)
> >         e.g) echo "hdd:50" > /sys/kernel/mm/swap/tiers. (0~49 not specifeid)
>
> Because removing the tier will generate holes in the priority range
> anyway. 0-49 is not specified in the same as if a tier is there
> previously then gets removed.

As discussed above, we’re allowing holes, so we can accept inputs that don’t
cover the full range.

> >       * Overlap with existing range: removal fails until all swap
> >         devices in that range are swapped off.
>
> Specifically a new tier priority landing in the middle of a tier
> range, the new tier will split the range with the existing one.

If swapoff is complete but removal has not occurred and a new tier comes in,
we can allow splitting. If a tier reference is still held, splitting should not
be allowed. A corner case: a tier spans 50–100 but only priorities 55 and 60
have active swap; inserting a split at 70 (no active refs) — to keep rules
simple, I’d still **not** accept the split while any references exist anywhere
in the original range.

> > * **Cgroup interface**
> >   - New files (under memcg): memory.swap.tier, memory.swap.tier.effective
>
> I don't think we need two interface files. We can live with one just
> "memory.swap.tiers"
> We can list the local one first then effective one on the second line
> or separate with "#"

Ack. One file is simpler; show local then effective. For now, a newline
separator looks clearer than “#”.

> >   - Syntax modeled after cpuset:
> >       echo "ssd-hdd,net" > memory.swap.tier
>
> Need discussion for incremental subtract. That syntax is incremental add.

I think the format you suggest (+, -)
is appropriate from a flexibility perspective.

> > * **Swap allocation**
> >   - Simple, workable implementation (TBD; to be revisited with
> >     measurements).

Ack.

Best regards,
Youngjun Park