lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aKC+EU3I/qm6TcjG@yjaykim-PowerEdge-T330>
Date: Sun, 17 Aug 2025 02:21:21 +0900
From: YoungJun Park <youngjun.park@....com>
To: Chris Li <chrisl@...nel.org>
Cc: Michal Koutný <mkoutny@...e.com>,
	akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org,
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev,
	muchun.song@...ux.dev, shikemeng@...weicloud.com,
	kasong@...cent.com, nphamcs@...il.com, bhe@...hat.com,
	baohua@...nel.org, cgroups@...r.kernel.org, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, gunho.lee@....com,
	iamjoonsoo.kim@....com, taejoon.song@....com,
	Matthew Wilcox <willy@...radead.org>,
	David Hildenbrand <david@...hat.com>,
	Kairui Song <ryncsn@...il.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
 cgroup-based swap priority

On Fri, Aug 15, 2025 at 08:10:09AM -0700, Chris Li wrote:
> Hi Michal and YoungJun,

First of all, thank you for sharing your thoughts. I really appreciate the
detailed feedback. I have many points I would like to think through and
discuss as well. For now, let me give some quick feedback, and I will follow
up with more detailed responses after I have had more time to reflect.

> I am sorry for the late reply. I have briefly read through the patches
> series the overall impression:
> 1)  Priority is not the best way to select which swap file to use per cgroup.
> The priority is assigned to one device, it is a per swap file local
> change. The effect you want to see is actually a global one, how this
> swap device compares to other devices. You actually want  a list at
> the end result. Adjusting per swap file priority is backwards. A lot
> of unnecessary usage complexity and code complexity come from that.
> 2)  This series is too complicated for what it does.

You mentioned that the series is overly complex and does more than what is
really needed. I understand your concern. I have spent quite a lot of time
thinking about this topic, and the reason I chose the priority approach is
that it gives more flexibility and extensibility by reusing an existing
concept.

Where you see unnecessary functionality, I tend to view it as providing more
degrees of freedom and flexibility. In my view, the swap tier concept can be
expressed as a subset of the per-cgroup priority model.

> I have a similar idea, "swap.tiers," first mentioned earlier here:
> https://lore.kernel.org/linux-mm/CAF8kJuNFtejEtjQHg5UBGduvFNn3AaGn4ffyoOrEnXfHpx6Ubg@mail.gmail.com/
> 
> I will outline the line in more detail in the last part of my reply.
> 
> BTW, YoungJun and Michal, do you have the per cgroup swap file control
> proposal for this year's LPC? If you want to, I am happy to work with
> you on the swap tiers topic as a secondary. I probably don't have the
> time to do it as a primary.

I have not submitted an LPC proposal. If it turns out to be necessary,
I agree it could be a good idea, and I truly appreciate your offer to
work together on it. From my understanding, though, the community has
so far received this patchset positively, so I hope the discussion can
continue within this context and eventually be accepted there.
 
> OK. I want to abandon the weight-adjustment approach. Here I outline
> the swap tiers idea as follows. I can probably start a new thread for
> that later.
> 
> 1) No per cgroup swap priority adjustment. The swap file priority is
> global to the system.
> Per cgroup swap file ordering adjustment is bad from the LRU point of
> view. We should make the swap file ordering matching to the swap
> device service performance. Fast swap tier zram, zswap store hotter
> data, slower tier hard drive store colder data.  SSD in between. It is
> important to maintain the fast slow tier match to the hot cold LRU
> ordering.

Regarding your first point about swap tiers: I would like to study this part
a bit more carefully. If you could share some additional explanation, that
would be very helpful for me.
 
> 2) There is a simple mapping of global swap tier names into priority range
> The name itself is customizable.
> e.g. 100+ is the "compress_ram" tier. 50-99 is the "SSD" tier, 0-55 is
> the "hdd" tier.
> The detailed mechanization and API is TBD.
> The end result is a simple tier name lookup will get the priority range.
> By default all swap tiers are available for global usage without
> cgroup. That matches the current global swap on behavior.
> 
> 3) Each cgroup will have "swap.tiers" (name TBD) to opt in/out of the tier.
> It is a list of tiers including the default tier who shall not be named.
> 
> Here are a few examples:
> e.g. consider the following cgroup hierarchy a/b/c/d, a as the first
> level cgroup.
> a/swap.tiers: "- +compress_ram"
> it means who shall not be named is set to opt out,  optin in
> compress_ram only, no ssd, no hard.
> Who shall not be named, if specified, has to be the first one listed
> in the "swap.tiers".
> 
> a/b/swap.tiers: "+ssd"
> For b cgroup, who shall not be named is not specified, the tier is
> appended to the parent "a/swap.tiers". The effective "a/b/swap.tiers"
> become "- +compress_ram +ssd"
> a/b can use both zswap and ssd.
> 
> Every time the who shall not be named is changed, it can drop the
> parent swap.tiers chain, starting from scratch.
> 
> a/b/c/swap.tiers: "-"
> 
> For c, it turns off all swap. The effective "a/b/c/swap.tiers" become
> "- +compress_ram +ssd -" which simplify as "-", because the second "-"
> overwrites all previous optin/optout results.
> In other words, if the current cgroup does not specify the who shall
> not be named, it will walk the parent chain until it does. The global
> "/" for non cgroup is on.
> 
> a/b/c/d/swap.tiers: "- +hdd"
> For d, only hdd swap, nothing else.
> 
> More example:
>  "- +ssd +hdd -ssd" will simplify to: "- +hdd", which means hdd only.
>  "+ -hdd": No hdd for you! Use everything else.
> 
> Let me know what you think about the above "swap.tiers"(name TBD) proposal.

Thank you very much for the detailed description of the "swap.tiers" idea.
As I understand it, the main idea is to separate swap devices by speed,
assign a suitable priority range for each, and then make it easy for users to
include or exclude tiers. I believe I have understood the concept clearly.

I agree that operating with tiers is important. At the same time, as I
mentioned earlier, I believe that managing priorities in a way that reflects
tiers can also achieve the intended effect.

I have also been thinking about a possible compromise. If the interface is
intended to make tiers visible to users in the way you describe, then mapping
priority ranges to tiers (as you propose) makes sense. Users would still have
the flexibility to define ordering, while internally we could maintain the
priority list model I suggested. I wonder what you think about such a hybrid
approach. 

Thank you as always for your valuable insights.

Best regards,
Youngjun Park

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ