linux-kernel - Re: [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACePvbVkBPJBaowW0tQrL0mPqSq5kM1hNx91BX_JroM8ruS7sQ@mail.gmail.com>
Date: Mon, 17 Nov 2025 14:17:43 -0800
From: Chris Li <chrisl@...nel.org>
To: SeongJae Park <sj@...nel.org>
Cc: Youngjun Park <youngjun.park@....com>, akpm@...ux-foundation.org, linux-mm@...ck.org, 
	cgroups@...r.kernel.org, linux-kernel@...r.kernel.org, kasong@...cent.com, 
	hannes@...xchg.org, mhocko@...nel.org, roman.gushchin@...ux.dev, 
	shakeel.butt@...ux.dev, muchun.song@...ux.dev, shikemeng@...weicloud.com, 
	nphamcs@...il.com, bhe@...hat.com, baohua@...nel.org, gunho.lee@....com, 
	taejoon.song@....com
Subject: Re: [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control

On Sat, Nov 15, 2025 at 9:24 AM SeongJae Park <sj@...nel.org> wrote:
>
> On Sat, 15 Nov 2025 07:13:49 -0800 Chris Li <chrisl@...nel.org> wrote:
> > Thank you for your interest. Please keep in mind that this patch
> > series is RFC. I suspect the current series will go through a lot of
> > overhaul before it gets merged in. I predict the end result will
> > likely have less than half of the code resemble what it is in the
> > series right  now.
>
> Sure, I belive this work will greatly evolve :)

Yes, we can use any eyes that can help to review or spot bugs.

> > > Nevertheless, I'm curious if there is simpler and more flexible ways to achieve
> > > the goal (control of swap device to use).  For example, extending existing
> > Simplicity is one of my primary design principles. The current design
> > is close to the simplest within the design constraints.
>
> I agree the concept is very simple.  But, I was thinking there _could_ be
> complexity for its implementation and required changes to existing code.
> Especially I'm curious about how the control logic for tiers maangement would
> be implemented in a simple but optimum and flexible way.  Hence I was lazily
> thinking what if we just let users make the control.

The selection of the swap device will be at the swap allocator. The
good news is that we just rewrite the whole swap allocator so it is an
easier code base to work with for us than the previous swap allocator.
I haven't imagined how to implement swap file selection on the
previous allocator, I am just glad that I don't need to worry about
it.

Some feedback on the madvise API that selects one specific device.
That might sound simple, because you only need to remember one swap
file. However, the less than ideal part is that, you are pinned to one
swap file, if that swap file is full, you are stuck. If that swap file
has been swapoff, you are stuck.

I believe that allowing selection of a tier class, e.g. a QoS aspect
of the swap latency expectation, is better fit what the user really
wants to do. So I see selecting swapfile vs swap tier is a separate
issue of how to select the swap device (madvise vs memory.swap.tiers).
Your argument is that selecting a tier is more complex than selecting
a swap file directly. I agree from an implementation point of view.
However the tiers offer better flexibility and free users from the
swapfile pinning. e.g. round robin on a few swap files of the same
tier is better than pinning to one swap file. That has been proven
from Baoquan's test benchmark.

Another feedback is that user space isn't the primary one to perform
swap out by madivse PAGEOUT. A lot of swap happens due to the cgroup
memory usage hitting the memory cgroup limit, which triggers the swap
out from the memory cgroup that hit the limit. That is an existing
usage case and we have a need to select which swap file anyway. If we
extend the madvise for per swapfile selection, that is a question that
must have an answer for native swap out (by the kernel not madvise)
anyway.  I can see  the user space wants to set the POLICY about a VMA
if it ever gets swapped out, what speed of swap file it goes to. That
is a follow up after we have the swapfile selection at the memory
cgroup level.

> I'm not saying tiers approach's control part implementation will, or is,
> complex or suboptimum.  I didn't read this series thoroughly yet.
>
> Even if it is at the moment, as you pointed out, I believe it will evolve to a
> simple and optimum one.  That's why I am willing to try to get time for reading
> this series and learn from it, and contribute back to the evolution if I find
> something :)
>
> >
> > > proactive pageout features, such as memory.reclaim, MADV_PAGEOUT or
> > > DAMOS_PAGEOUT, to let users specify the swap device to use.  Doing such
> >
> > In my mind that is a later phase. No, per VMA swapfile is not simpler
> > to use, nor is the API simpler to code. There are much more VMA than
> > memcg in the system, no even the same magnitude. It is a higher burden
> > for both user space and kernel to maintain all the per VMA mapping.
> > The VMA and mmap path is much more complex to hack. Doing it on the
> > memcg level as the first step is the right approach.
> >
> > > extension for MADV_PAGEOUT may be challenging, but it might be doable for
> > > memory.reclaim and DAMOS_PAGEOUT.  Have you considered this kind of options?
> >
> > Yes, as YoungJun points out, that has been considered here, but in a
> > later phase. Borrow the link in his email here:
> > https://lore.kernel.org/linux-mm/CACePvbW_Q6O2ppMG35gwj7OHCdbjja3qUCF1T7GFsm9VDr2e_g@mail.gmail.com/
>
> Thank you for kindly sharing your opinion and previous discussion!  I
> understand you believe sub-cgroup (e.g., vma level) control of swap tiers can
> be useful, but there is no expected use case, and you concern about its
> complexity in terms of implementation and interface.  That all makes sense to
> me.

There is some usage request from Android wanting to protect some VMA
never getting swapped into slower tiers. Otherwise it can cause
jankiness. Still I consider the cgroup swap file selection is a more
common one.

> Nonetheless, I'm not saying about sub-cgroup control.  As I also replied [1] to
> Youngjun, memory.reclaim and DAMOS_PAGEOUT based extension would work in cgroup
> level.  And to my humble perspective, doing the extension could be doable, at
> least for DAMOS_PAGEOUT.

I would do it one thing at a time and start from the mem cgroup level
swap file selection e.g. "memory.swap.tiers". However, if you are
passionate about VMA level swap file selection, please feel free to
submit patches for it.

> Hmm, I feel like my mail might be read like I'm suggesting you to use
> DAMOS_PAGEOUT.  The decision is yours and I will respect it, of course.  I'm
> saying this though, because I am uncautiously but definitely biased as DAMON
> maintainer. ;)  Again, the decision is yours and I will respect it.
>
> [1] https://lore.kernel.org/20251115165637.82966-1-sj@kernel.org

Sorry I haven't read much about the DAMOS_PAGEOUT yet. After reading
the above thread, I still don't feel I have a good sense of
DAMOS_PAGEOUT. Who is the actual user that requested that feature and
what is the typical usage work flow and life cycle? BTW, I am still
considering the per VMA swap policy should happen after the
memory.swap.tiers given my current understanding.

Chris