linux-kernel - Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aLqDkpGr4psGFOcF@yjaykim-PowerEdge-T330>
Date: Fri, 5 Sep 2025 15:30:42 +0900
From: YoungJun Park <youngjun.park@....com>
To: Chris Li <chrisl@...nel.org>
Cc: Michal Koutný <mkoutny@...e.com>,
	akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org,
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev,
	muchun.song@...ux.dev, shikemeng@...weicloud.com,
	kasong@...cent.com, nphamcs@...il.com, bhe@...hat.com,
	baohua@...nel.org, cgroups@...r.kernel.org, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, gunho.lee@....com,
	iamjoonsoo.kim@....com, taejoon.song@....com,
	Matthew Wilcox <willy@...radead.org>,
	David Hildenbrand <david@...hat.com>,
	Kairui Song <ryncsn@...il.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
 cgroup-based swap priority

> Yes, that works. I would skip the "add" keyword.
> Also I notice that we can allow " " in place of "," as a separator as well.

Yes, supporting both " " and "," sounds convenient.

> Maybe instead of "remove hdd", just "-hdd" which is similar to how to
> operate on swap.tiers.

Agreed, "+" for add and "-" for remove is simpler.

> Oh, you mean the tier not listed in the above will be deleted.
> I prefer the above option 1) then.

That makes sense. Option 1) looks simplest overall.

> I don't understand what is this "removing" and "in stage"...
> What is it trying to solve?

That came from an idea to pre-add a new tier before removing another.
But I now think returning an error on overlap is simpler, so staging is
not needed.

> What do you mean by "visible"? Previous discussions haven't defined
> what is visible vs invisible.

By “visible” I meant a staged state becoming active. I realize the term
was confusing. and it is not needed as I already explained.

> Trigger event to notify user space? Who consumes the event and what
> can that user space tool do?

I agree, sending user events is unnecessary. It is simpler to let tiers merge or
be recreated and let the allocator handle it.

> If you remove the
> swap tier. the range of that tier merges to the neighbour tier.  That
> way you don't need to worry about the swap file already having an
> entry in this tier you swap out.

Should the configured mask simply be left as-is,
even if (a) the same key is later reintroduced with a different order (e.g.,
first → third), or (b) a merge causes the cgroup to use a lower tier it did not
explicitly select? 
I infer that leaving the mask unchanged is acceptable and this concern
may be unnecessary. if you consider this unnecessary, I am fine
to follow the simpler direction you suggested.

> If the fast path fails, it will go through the slow path. So the slow
> path is actually a catch all.

I think my intention may not have come across clearly. I was not trying
to propose a new optimization, but to describe a direction that requires
almost no changes from the current behavior. Looking back, I realize the
ideas I presented may not have looked like small adjustments, even
though that was my intent.

As a simple approach I had in mind:
- Fastpath can just skip clusters outside the selected tier.
- Slowpath naturally respects the tier bitmask.
- The open point is how to treat the per-CPU cache.

If we insert clusters back, tiered and non-tiered cgroups may see
low-priority clusters. If we skip insertion, tiered cgroups may lose
caching benefits.

Chris, do you have another workable approach in mind here, or is this
close to what you were also thinking?

> In my original proposal, if a parent removes ssd then the child will
> automatically get it as well.

I now see you mean the effective mask is built by walking parents with local
settings taking precedence, top to bottom, preferring the nearest local
setting. Conceptually this yields two data structures: a local-setting mask and
a runtime/effective mask. Does the above capture your intention, or is there
anything else I should mention?

A few thoughts aligned with the above:
- There is no separate “default setting” knob to control inheritance.
- If unset locally, the effective value is derived by walking the cgroup
  hierarchy from top to bottom.
- Once set locally, the local setting overrides everything inherited.
- There is no special “default tier” when tiers are absent.
- If nothing is set anywhere in the hierarchy, the initial mask is treated as
  fully set at configuration time (selecting all tiers; global swap behavior).
  However, reading the local file should return an empty value to indicate
  “not set”.

One idea is to precompute the effective mask at interface write time, since
writes are rarer than swap I/O. You may have intended runtime recomputation
instead—which approach do you prefer? This implies two masks: a local
configuration mask and a computed effective mask.

And below is a spec summary I drafted, based on our discussion so far for
note and alignment. 
(Some points in this reply remain unresolved, and there are additional TBD items.)

* **Tier specification**
  - Priority >= 0 range is divided into intervals, each identified by a
    tier name. The full 0+ range must be covered.
  - NUMA autobind and tiering are mutually exclusive.
  - Max number of tiers = MAX_SWAPFILES (single swap device can also be
    assigned as a tier).
  - A tier holds references when swap devices are assigned to its
    priority range. Removal is only possible after swapoff clears the
    references.
  - Cgroups referencing a tier do not hold references. If the tier is
    removed, the cgroup’s configured mask is dropped. (TBD)
  - Each tier has an order (tier1 is highest priority) and an internal
    bit for allocation logic.
  - Until it is set, there is no default tier. 
    (may internally conceptually used? but not exported)

* **/sys/kernel/mm/swap/tiers**
  - Read/write interface. Multiple entries allowed, delimiters: space or
    comma.
  - Format:
      + "tier name":priority  → add (priority and above)
      - "tier name"           → remove
    Note: a space must follow "+" or "-" before the tier name.
  - Edge cases:
      * If not all ranges are specified: input is accepted, but cgroups
        cannot use incomplete ranges. (TBD)
	e.g) echo "hdd:50" > /sys/kernel/mm/swap/tiers. (0~49 not specifeid)
      * Overlap with existing range: removal fails until all swap
        devices in that range are swapped off.
  - Output is sorted, showing tier order along with name, bit, and
    priority range. (It may be more user-friendly to explicitly show
    tier order. (TBD))

* **Cgroup interface**
  - New files (under memcg): memory.swap.tier, memory.swap.tier.effective
    * Read/write: memory.swap.tier returns the local named set exactly
      as configured (cpuset-like "+/-" tokens; space/comma preserved).
    * Read-only: memory.swap.tier.effective is computed from the cgroup
      hierarchy, with the nearest local setting taking precedence
      (similar to cpuset.effective). (TBD)
    * Example (named-set display, cpuset-like style)

      Suppose tier order:
        ssd (tier1), hdd (tier2), hdd2 (tier3), net (tier4)

      Input:
        echo "ssd-hdd, net" > memory.swap.tier

      Readback:
        cat memory.swap.tier
          ssd-hdd, net     # exactly as configured (named set)

        cat memory.swap.tier.effective
          ssd-hdd, net     # same format; inherited/effective result

  - Inheritance: effective mask built by walking from parent to child,
    with local settings taking precedence.
  - Mask computation: precompute at interface write-time vs runtime
    recomputation. (TBD; preference?)
  - Syntax modeled after cpuset:
      echo "ssd-hdd,net" > memory.swap.tier
    Here “-” specifies a range and must respect tier order. Items
    separated by “,” do not need to follow order and may overlap; they
    are handled appropriately (similar to cpuset semantics).

* **Swap allocation**
  - Simple, workable implementation (TBD; to be revisited with
    measurements).

I tried to summarize the discussion and my inline responses as clearly as
possible. If anything is unclear or I misinterpreted something, please
tell me and I’ll follow up promptly to clarify. If you have comments, I
will be happy to continue the discussion. Hopefully this time our
alignment will be clearer.

Best regards,
Youngjun Park