lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aEwBPG8mvZ06YQEA@yjaykim-PowerEdge-T330>
Date: Fri, 13 Jun 2025 19:45:16 +0900
From: YoungJun Park <youngjun.park@....com>
To: Kairui Song <ryncsn@...il.com>
Cc: Nhat Pham <nphamcs@...il.com>, linux-mm@...ck.org,
	akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org,
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev,
	cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
	shikemeng@...weicloud.com, bhe@...hat.com, baohua@...nel.org,
	chrisl@...nel.org, muchun.song@...ux.dev, iamjoonsoo.kim@....com,
	taejoon.song@....com, gunho.lee@....com
Subject: Re: [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority
 mechansim on swap layer

On Fri, Jun 13, 2025 at 03:38:37PM +0800, Kairui Song wrote:
> On Fri, Jun 13, 2025 at 3:36 PM Kairui Song <ryncsn@...il.com> wrote:
> >
> > On Fri, Jun 13, 2025 at 3:11 PM YoungJun Park <youngjun.park@....com> wrote:
> > >
> > > On Thu, Jun 12, 2025 at 01:08:08PM -0700, Nhat Pham wrote:
> > > > On Thu, Jun 12, 2025 at 11:20 AM Kairui Song <ryncsn@...il.com> wrote:
> > > > >
> > > > > On Fri, Jun 13, 2025 at 1:28 AM Nhat Pham <nphamcs@...il.com> wrote:
> > > > > >
> > > > > > On Thu, Jun 12, 2025 at 4:14 AM Kairui Song <ryncsn@...il.com> wrote:
> > > > > > >
> > > > > > > On Thu, Jun 12, 2025 at 6:43 PM <youngjun.park@....com> wrote:
> > > > > > > >
> > > > > > > > From: "youngjun.park" <youngjun.park@....com>
> > > > > > > >
> > > > > > >
> > > > > > > Hi, Youngjun,
> > > > > > >
> > > > > > > Thanks for sharing this series.
> > > > > > >
> > > > > > > > This patch implements swap device selection and swap on/off propagation
> > > > > > > > when a cgroup-specific swap priority is set.
> > > > > > > >
> > > > > > > > There is one workaround to this implementation as follows.
> > > > > > > > Current per-cpu swap cluster enforces swap device selection based solely
> > > > > > > > on CPU locality, overriding the swap cgroup's configured priorities.
> > > > > > >
> > > > > > > I've been thinking about this, we can switch to a per-cgroup-per-cpu
> > > > > > > next cluster selector, the problem with current code is that swap
> > > > > >
> > > > > > What about per-cpu-per-order-per-swap-device :-? Number of swap
> > > > > > devices is gonna be smaller than number of cgroups, right?
> > > > >
> > > > > Hi Nhat,
> > > > >
> > > > > The problem is per cgroup makes more sense (I was suggested to use
> > > > > cgroup level locality at the very beginning of the implementation of
> > > > > the allocator in the mail list, but it was hard to do so at that
> > > > > time), for container environments, a cgroup is a container that runs
> > > > > one type of workload, so it has its own locality. Things like systemd
> > > > > also organize different desktop workloads into cgroups. The whole
> > > > > point is about cgroup.
> > > >
> > > > Yeah I know what cgroup represents. Which is why I mentioned in the
> > > > next paragraph that are still making decisions based per-cgroup - we
> > > > just organize the per-cpu cache based on swap devices. This way, two
> > > > cgroups with similar/same priority list can share the clusters, for
> > > > each swapfile, in each CPU. There will be a lot less duplication and
> > > > overhead. And two cgroups with different priority lists won't
> > > > interfere with each other, since they'll target different swapfiles.
> > > >
> > > > Unless we want to nudge the swapfiles/clusters to be self-partitioned
> > > > among the cgroups? :) IOW, each cluster contains pages mostly from a
> > > > single cgroup (with some stranglers mixed in). I suppose that will be
> > > > very useful for swap on rotational drives where read contiguity is
> > > > imperative, but not sure about other backends :-?
> > > > Anyway, no strong opinions to be completely honest :) Was just
> > > > throwing out some ideas. Per-cgroup-per-cpu-per-order sounds good to
> > > > me too, if it's easy to do.
> > >
> > > Good point!
> > > I agree with the mention that self-partitioned clusters and duplicated priority.
> > > One concern is the cost of synchronization.
> > > Specifically the one incurred when accessing the prioritized swap device
> > > From a simple performance perspective, a per-cgroup-per-CPU implementation
> > > seems favorable - in line with the current swap allocation fastpath.
> > >
> > > It seems most reasonable to carefully compare the pros and cons of the
> > > tow approaches.
> > >
> > > To summaraize,
> > >
> > > Option 1. per-cgroup-per-cpu
> > > Pros: upstream fit. performance.
> > > Cons: duplicate priority(some memory structure consumtion cost),
> > > self partioned cluster
> > >
> > > Option 2. per-cpu-per-order(per-device)
> > > Pros: Cons of Option1
> > > Cons: Pros of Option1
> > >
> > > It's not easy to draw a definitive conclusion right away,
> > > I should also evaluate other pros and cons that may arise during actual
> > > implementation.
> > > so I'd like to take some time to review things in more detail
> > > and share my thoughs and conclusions in the next patch series.
> > >
> > > What do you think, Nhat and Kairui?
> >
> > Ah, I think what might be best fits here is, each cgroup have a pcp
> > device list,  and each device have a pcp cluster list:
> >
> > folio -> mem_cgroup -> swap_priority (maybe a more generic name is
> > better?) -> swap_device_pcp (recording only the *si per order)
> > swap_device_info -> swap_cluster_pcp (cluster offset per order)
> 
> Sorry the truncate made this hard to read, let me try again:
> 
> folio ->
>   mem_cgroup ->
>     swap_priority (maybe a more generic name is better?) ->
>       swap_device_pcp (recording only the *si per order)
> 
> And:
> swap_device_info ->
>   swap_cluster_pcp (cluster offset per order)
> 
> And if mem_cgroup -> swap_priority is NULL,
> fallback to a global swap_device_pcp.

Thank you for quick and kind feedback. This is a really good idea :)
On my workaround proposal, I just need to add the swap_device_pcp part 
along with some refactoring.

And the naming swap_cgroup_priority...
I adopted the term "swap_cgorup_priority" based on the
perspective of the functionality I'm aiming to implement.
Here are some words that immediately come to mind.
(Like I said, just come to mind)
* swap_tier, swap_order, swap_selection, swap_cgroup_tier, swap_cgroup_order,
swap_cgroup_selection....

I'll try to come up with a more suitable conceptual name as I continue working
on the patch.
In the meantime, I'd appreciate any suggestions or feedback you may have.

Thanks again your feedback and suggestions.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ