lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aH/baxIgrBI3Z1Hl@yjaykim-PowerEdge-T330>
Date: Wed, 23 Jul 2025 03:41:47 +0900
From: YoungJun Park <youngjun.park@....com>
To: Michal Koutný <mkoutny@...e.com>
Cc: akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org,
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev,
	muchun.song@...ux.dev, shikemeng@...weicloud.com,
	kasong@...cent.com, nphamcs@...il.com, bhe@...hat.com,
	baohua@...nel.org, chrisl@...nel.org, cgroups@...r.kernel.org,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, gunho.lee@....com,
	iamjoonsoo.kim@....com, taejoon.song@....com
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
 cgroup-based swap priority

On Tue, Jul 22, 2025 at 11:05:24PM +0900, YoungJun Park wrote:
> On Tue, Jul 22, 2025 at 10:41:20AM +0200, Michal Koutný wrote:
> > On Thu, Jul 17, 2025 at 05:20:03AM +0900, Youngjun Park <youngjun.park@....com> wrote:
> > > +  memory.swap.priority
> > > +    A read-write flat-keyed file which exists on non-root cgroups.
> > > +    This interface allows you to set per-swap-device priorities for the current
> > > +    cgroup and to define how they differ from the global swap system.
> > > +
> > > +    To assign priorities or define specific behaviors for swap devices
> > > +    in the current cgroup, write one or more lines in the following
> > > +    formats:
> > > +
> > > +     - <swap_device_id> <priority>
> > > +     - <swap_device_id> disabled
> > > +     - <swap_device_id> none
> > > +     - default none
> > > +     - default disabled
> > > +
> > > +    Each <swap_device_id> refers to a unique swap device registered
> > > +    in the system. You can check the ID, device path, and current
> > > +    priority of active swap devices through the `/proc/swaps` file.
> > 
> > Do you mean row number as the ID? Or does this depend on some other
> > patches or API?
> 
> You're right to ask for clarification. The `<swap_device_id>` refers
> to a unique identifier added to each swap device entry in `/proc/swaps`.
> I will revise the documentation to make this clearer.
> 
> As a side note, I initially had concerns about breaking the existing ABI.
> However, the additional ID column does not significantly change the
> current output format and is gated behind `CONFIG_SWAP_CGROUP_PRIORITY`,
> so it should be safe and intuitive to expose it through `/proc/swaps
> 
> > > +    This provides a clear mapping between swap devices and the IDs
> > > +    used in this interface.
> > > +
> > > +    The 'default' keyword sets the fallback priority behavior rule for
> > > +    this cgroup. If no specific entry matches a swap device, this default
> > > +    applies.
> > > +
> > > +    * 'default none': This is the default if no configuration
> > > +      is explicitly written. Swap devices follow the system-wide
> > > +      swap priorities.
> > > +
> > > +    * 'default disabled': All swap devices are excluded from this cgroup’s
> > > +      swap priority list and will not be used by this cgroup.
> > 
> > This duplicates memory.swap.max=0. I'm not sure it's thus necessary.
> > At the same time you don't accept 'default <priority>' (that's sane).
> 
> That's a valid observation. While `memory.swap.max=0` controls the overall
> swap usage limit, the `default disabled` entry is intended to disable
> specific swap devices within the scope of this cgroup interface. The
> motivation was to offer more granular control over device selection
> rather than total swap usage.
> 
> > > +
> > > +    The priority semantics are consistent with the global swap system:
> > > +
> > > +      - Higher numerical values indicate higher preference.
> > > +      - See Documentation/admin-guide/mm/swap_numa.rst for details on
> > > +        swap NUMA autobinding and negative priority rules.
> > > +
> > > +    The handling of negative priorities in this cgroup interface
> > > +    has specific behaviors for assignment and restoration:
> > > +
> > > +    * Negative Priority Assignment
> > 
> > Even in Documentation/admin-guide/mm/swap_numa.rst it's part of "Implementation details".
> > I admit I'm daunted by this paragraphs. Is it important for this interface?
> 
> Thank you for pointing this out. My original philosophy was to preserve
> as much of the existing swap functionality as possible, including
> NUMA-aware behaviors.
> 
> However, I agree that the explanation is complex and also not be
> necessary for my proposed usage. After some reflection, I believe the
> implementation (and documentation) will be clearer and simpler without
> supporting negative priorities here. 
> 
> Unless further objections arise, I plan to drop this behavior in the next
> version of the patch, as you suggested. If compelling use cases emerge in
> the future, we can consider reintroducing the support at that time.
> 
> Thanks again for your helpful review!

I'd like to revisit the NUMA-aware swap priority behavior based on
further implementation consideration. After refining the idea 
, I realized there are potential issues
if we fully remove NUMA autobind behavior when cgroup priorities are
set.

For example, suppose the global swap device priorities are configured
as:

  swapA -2
  swapB -3
  swapC -4

If we update the per-cgroup priority of swapA to a positive value, it
feels natural that only swapA should be affected, and swapB/swapC
should remain subject to NUMA autobind as configured globally. In other
words, the presence of one overridden device shouldn't disable autobind
entirely.

Thus, it seems that we may still need to retain some internal structure
for honoring NUMA autobind even when swap cgroup priority is enabled,
at least for the devices not explicitly overridden.

This leaves us with a few design options:

1. Treat negative values as valid priorities. Once any device is
   assigned via `memory.swap.priority`, the NUMA autobind logic is
   entirely disabled.
   - Pros: Simplifies implementation; avoids exposing NUMA autobind via
     cgroup interface.
   - Cons: Overrides autobind for all devices even if only one is set.

2. Continue to treat negative values as NUMA autobind weights, without
   implicit shifting. If a user assigns `-3`, it is stored and used
   exactly as `-3`, and does not affect other devices.
   - Pros: Simple and intuitive; matches current implementation
     semantics.
   - Cons: Autobind semantics still need to be reasoned about when
     using the interface.

3. Disallow setting negative values via `memory.swap.priority`.
   Existing NUMA autobind config is preserved, but no new autobind
   configuration is possible from cgroup interface.
   - Pros: Keeps cgroup interface simple; no autobind manipulation.
   - Cons: Autobind infra remains partially active, increasing code
     complexity.

4. Keep the current design: allow setting negative values to express
   NUMA autobind weights explicitly. Devices without overridden values
   continue to follow NUMA-based dynamic selection.
   - Pros: Preserves current flexibility; gives users control per device.
   - Cons: Slightly more complex semantics; NUMA autobind remains a
     visible part of the interface.

After thinking through these tradeoffs, I'm inclined to think that
preserving the NUMA autobind option might be the better path forward.
What are your thoughts on this?

Thank you again for your helpful feedback.

Best regards,
Youngjun Park

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ