[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aRU1abZaS+c1rK4R@yjaykim-PowerEdge-T330>
Date: Thu, 13 Nov 2025 10:33:29 +0900
From: YoungJun Park <youngjun.park@....com>
To: Chris Li <chrisl@...nel.org>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org, cgroups@...r.kernel.org,
linux-kernel@...r.kernel.org, kasong@...cent.com,
hannes@...xchg.org, mhocko@...nel.org, roman.gushchin@...ux.dev,
shakeel.butt@...ux.dev, muchun.song@...ux.dev,
shikemeng@...weicloud.com, nphamcs@...il.com, bhe@...hat.com,
baohua@...nel.org, gunho.lee@....com, taejoon.song@....com
Subject: Re: [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap
control
On Wed, Nov 12, 2025 at 05:34:05AM -0800, Chris Li wrote:
Hello Chris :)
> Thanks for the patches. I notice that your cover letter does not have
> [0/3] on it. One tool I found useful is using the b4 to send out
> patches in series. Just for your consideration, it is not an ask. I
> can review patches not sent out from b4 just fine.
I manually edited the cover letter title, but made a human error.
Thanks for the tip.
> On Sun, Nov 9, 2025 at 4:50 AM Youngjun Park <youngjun.park@....com> wrote:
> >
> > Hi all,
> >
> > In constrained environments, there is a need to improve workload
> > performance by controlling swap device usage on a per-process or
> > per-cgroup basis. For example, one might want to direct critical
> > processes to faster swap devices (like SSDs) while relegating
> > less critical ones to slower devices (like HDDs or Network Swap).
> >
> > Initial approach was to introduce a per-cgroup swap priority
> > mechanism [1]. However, through review and discussion, several
> > drawbacks were identified:
> >
> > a. There is a lack of concrete use cases for assigning a fine-grained,
> > unique swap priority to each cgroup.
> > b. The implementation complexity was high relative to the desired
> > level of control.
> > c. Differing swap priorities between cgroups could lead to LRU
> > inversion problems.
> >
> > To address these concerns, I propose the "swap tiers" concept,
> > originally suggested by Chris Li [2] and further developed through
> > collaborative discussions. I would like to thank Chris Li and
> > He Baoquan for their invaluable contributions in refining this
> > approach, and Kairui Song, Nhat Pham, and Michal Koutný for their
> > insightful reviews of earlier RFC versions.
> >
> > Concept
> > -------
> > A swap tier is a grouping mechanism that assigns a "named id" to a
> > range of swap priorities. For example, all swap devices with a
> > priority of 100 or higher could be grouped into a tier named "SSD",
> > and all others into a tier named "HDD".
> >
> > Cgroups can then select which named tiers they are permitted to use for
> > swapping via a new cgroup interface. This effectively restricts a
> > cgroup's swap activity to a specific subset of the available swap
> > devices.
> >
> > Proposed Interface
> > ------------------
> > 1. Global Tier Definition: /sys/kernel/mm/swap/tiers
> >
> > This file is used to define the global swap tiers and their associated
> > minimum priority levels.
> >
> > - To add tiers:
> > Format: + 'tier_name':'prio'[,|' ']'tier_name 2':'prio']...
> > Example:
> > # echo "+ SSD:100,HDD:2" > /sys/kernel/mm/swap/tiers
>
> I think a lot of this documentation nature of the cover letter should
> move into a kernel document commit. Maybe
> Documentation/mm/swap_tiers.rst
I will create a Documentation file based on what is mentioned here.
> Another suggestion is use "+SSD:100,+HDD:2,-SD" that kind of flavor
> similar to "cgroup.subtree_control" interface, which allows adding or
> removing cgroups. That way you can add and remove in one line action.
Your suggested format is more familiar. I have no objections and will
change it accordingly.
> >
> > There are several rules for defining tiers:
> > - Priority ranges for tiers must not overlap.
>
> We can add that we suggest allocating a higher priority range for
> faster swap devices. That way more swap page faults will likely be
> served by faster swap devices.
It would be good to explicitly state this in the Documentation.
> > - The combination of all defined tiers must cover the entire valid
> > priority range (DEF_SWAP_PRIO to SHRT_MAX) to ensure every swap device
> > can be assigned to a tier.
> > - A tier's prio value is its inclusive lower bound,
> > covering priorities up to the next tier's prio.
> > The highest tier extends to SHRT_MAX, and the lowest tier extends to DEF_SWAP_PRIO.
> > - If the specified tiers do not cover the entire priority range,
> > the priority of the tier with the lowest specified priority value
> > is set to SHRT_MIN
> > - The total number of tiers is limited.
> >
> > - To remove tiers:
> > Format: - 'tier_name'[,|' ']'tier_name2']...
> > Example:
> > # echo "- SSD,HDD" > /sys/kernel/mm/swap/tiers
>
> See above, make the '-SSD, -HDD' similar to the "cgroup.subtree_control"
Ack as I said before commenct. Thanks for suggestion again.
> > Note: A tier cannot be removed if it is currently in use by any
> > cgroup or if any active swap device is assigned to it. This acts as
> > a reference count to prevent disruption.
> >
> > - To show current tiers:
> > Reading the file displays the currently configured tiers, their
> > internal index, and the priority range they cover.
> > Example:
> > # echo "+ SSD:100,HDD:2" > /sys/kernel/mm/swap/tiers
> > # cat /sys/kernel/mm/swap/tiers
> > Name Idx PrioStart PrioEnd
> > 0
> > SSD 1 100 32767
> > HDD 2 -1 99
> >
> > - `Name`: The name of the tier. The unnamed entry is a default tier.
> > - `Idx`: The internal index assigned to the tier.
> > - `PrioStart`: The starting priority of the range covered by this tier.
> > - `PrioEnd`: The ending priority of the range covered by this tier.
> >
> > Two special tiers are predefined:
> > - "": Represents the default inheritance behavior in cgroups.
> This belongs to the memory.swap.tiers section.
> "" is not a real tier's name. It is just a wide cast to refer to all tiers.
I will manage it separately as a logical tier that is not exposed to
users, and also handle it at the code level.
> > - "zswap": Reserved for zswap integration.
>
> One thing I realize is that, we might need to have per swap tier have
> a matching zswap tier. Otherwise when we refer to zswap, there is no
> way for the cgroup to select which backing swapfile does this zswap
> use for allocating the swap entry.
>From the perspective of per-cgroup swap control,
if a ZSWAP tier is assigned and a cgroup selects that tier,
it determines whether to use zswap or not.
However, since the zswap backend does not know which tier it is linked
to, there could be a mismatch between the zswap tier (1), the backend
storage (2), and possibly another layer (3).
This could lead to a contradiction where the cgroup-selected tier may
or may not correspond to the actual backend tier.
Is this the correct understanding?
> We can avoid this complexity by providing a dedicated ghost swapfile,
> which only zswap can use to allocate swap entries.
>From what I understood when youpreviously mentioned this concept,
the “ghost swapfile” is not a real swap device.
It exists conceptually so that zswap can operate as if there is a swap
device, but in reality, only compressed swap entries are managed by
zswap itself. (zswap needs actual swap for compress swap)
Considering both points above, could you please clarify the intended
direction?
Are you suggesting removing the zswap tier entirely, or defining a
specific way to manage it?
I would appreciate a bit more explanation on how you envision the zswap
tier being handled.
Best Regards,
Youngjun Park
Powered by blists - more mailing lists