[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251115172431.83156-1-sj@kernel.org>
Date: Sat, 15 Nov 2025 09:24:30 -0800
From: SeongJae Park <sj@...nel.org>
To: Chris Li <chrisl@...nel.org>
Cc: SeongJae Park <sj@...nel.org>,
Youngjun Park <youngjun.park@....com>,
akpm@...ux-foundation.org,
linux-mm@...ck.org,
cgroups@...r.kernel.org,
linux-kernel@...r.kernel.org,
kasong@...cent.com,
hannes@...xchg.org,
mhocko@...nel.org,
roman.gushchin@...ux.dev,
shakeel.butt@...ux.dev,
muchun.song@...ux.dev,
shikemeng@...weicloud.com,
nphamcs@...il.com,
bhe@...hat.com,
baohua@...nel.org,
gunho.lee@....com,
taejoon.song@....com
Subject: Re: [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
On Sat, 15 Nov 2025 07:13:49 -0800 Chris Li <chrisl@...nel.org> wrote:
> On Fri, Nov 14, 2025 at 5:22 PM SeongJae Park <sj@...nel.org> wrote:
> >
> > On Sun, 9 Nov 2025 21:49:44 +0900 Youngjun Park <youngjun.park@....com> wrote:
> >
> > > Hi all,
> > >
> > > In constrained environments, there is a need to improve workload
> > > performance by controlling swap device usage on a per-process or
> > > per-cgroup basis. For example, one might want to direct critical
> > > processes to faster swap devices (like SSDs) while relegating
> > > less critical ones to slower devices (like HDDs or Network Swap).
> > >
> > > Initial approach was to introduce a per-cgroup swap priority
> > > mechanism [1]. However, through review and discussion, several
> > > drawbacks were identified:
> > >
> > > a. There is a lack of concrete use cases for assigning a fine-grained,
> > > unique swap priority to each cgroup.
> > > b. The implementation complexity was high relative to the desired
> > > level of control.
> > > c. Differing swap priorities between cgroups could lead to LRU
> > > inversion problems.
> > >
> > > To address these concerns, I propose the "swap tiers" concept,
> > > originally suggested by Chris Li [2] and further developed through
> > > collaborative discussions. I would like to thank Chris Li and
> > > He Baoquan for their invaluable contributions in refining this
> > > approach, and Kairui Song, Nhat Pham, and Michal Koutný for their
> > > insightful reviews of earlier RFC versions.
> >
> > I think the tiers concept is a nice abstraction. I'm also interested in how
> > the in-kernel control mechanism will deal with tiers management, which is not
> > always simple. I'll try to take a time to read this series thoroughly. Thank
> > you for sharing this nice work!
>
> Thank you for your interest. Please keep in mind that this patch
> series is RFC. I suspect the current series will go through a lot of
> overhaul before it gets merged in. I predict the end result will
> likely have less than half of the code resemble what it is in the
> series right now.
Sure, I belive this work will greatly evolve :)
>
> > Nevertheless, I'm curious if there is simpler and more flexible ways to achieve
> > the goal (control of swap device to use). For example, extending existing
> Simplicity is one of my primary design principles. The current design
> is close to the simplest within the design constraints.
I agree the concept is very simple. But, I was thinking there _could_ be
complexity for its implementation and required changes to existing code.
Especially I'm curious about how the control logic for tiers maangement would
be implemented in a simple but optimum and flexible way. Hence I was lazily
thinking what if we just let users make the control.
I'm not saying tiers approach's control part implementation will, or is,
complex or suboptimum. I didn't read this series thoroughly yet.
Even if it is at the moment, as you pointed out, I believe it will evolve to a
simple and optimum one. That's why I am willing to try to get time for reading
this series and learn from it, and contribute back to the evolution if I find
something :)
>
> > proactive pageout features, such as memory.reclaim, MADV_PAGEOUT or
> > DAMOS_PAGEOUT, to let users specify the swap device to use. Doing such
>
> In my mind that is a later phase. No, per VMA swapfile is not simpler
> to use, nor is the API simpler to code. There are much more VMA than
> memcg in the system, no even the same magnitude. It is a higher burden
> for both user space and kernel to maintain all the per VMA mapping.
> The VMA and mmap path is much more complex to hack. Doing it on the
> memcg level as the first step is the right approach.
>
> > extension for MADV_PAGEOUT may be challenging, but it might be doable for
> > memory.reclaim and DAMOS_PAGEOUT. Have you considered this kind of options?
>
> Yes, as YoungJun points out, that has been considered here, but in a
> later phase. Borrow the link in his email here:
> https://lore.kernel.org/linux-mm/CACePvbW_Q6O2ppMG35gwj7OHCdbjja3qUCF1T7GFsm9VDr2e_g@mail.gmail.com/
Thank you for kindly sharing your opinion and previous discussion! I
understand you believe sub-cgroup (e.g., vma level) control of swap tiers can
be useful, but there is no expected use case, and you concern about its
complexity in terms of implementation and interface. That all makes sense to
me.
Nonetheless, I'm not saying about sub-cgroup control. As I also replied [1] to
Youngjun, memory.reclaim and DAMOS_PAGEOUT based extension would work in cgroup
level. And to my humble perspective, doing the extension could be doable, at
least for DAMOS_PAGEOUT.
Hmm, I feel like my mail might be read like I'm suggesting you to use
DAMOS_PAGEOUT. The decision is yours and I will respect it, of course. I'm
saying this though, because I am uncautiously but definitely biased as DAMON
maintainer. ;) Again, the decision is yours and I will respect it.
[1] https://lore.kernel.org/20251115165637.82966-1-sj@kernel.org
Thanks,
SJ
[...]
Powered by blists - more mailing lists