linux-kernel - Re: [PATCH v2] sched/numa: Introduce per cgroup numa balance control

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2cdba052-0c67-40f3-b5fd-dd9dbd08461f@intel.com>
Date: Thu, 26 Jun 2025 17:07:52 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Michal Koutný <mkoutny@...e.com>
CC: Jonathan Corbet <corbet@....net>, Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>, Shakeel Butt <shakeel.butt@...ux.dev>, "Juri
 Lelli" <juri.lelli@...hat.com>, Ben Segall <bsegall@...gle.com>, Libo Chen
	<libo.chen@...cle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
	<vschneid@...hat.com>, Andrew Morton <akpm@...ux-foundation.org>, "Liam R.
 Howlett" <Liam.Howlett@...cle.com>, Lorenzo Stoakes
	<lorenzo.stoakes@...cle.com>, Vlastimil Babka <vbabka@...e.cz>, Phil Auld
	<pauld@...hat.com>, Tejun Heo <tj@...nel.org>, Daniel Jordan
	<daniel.m.jordan@...cle.com>, Jann Horn <jannh@...gle.com>, Pedro Falcato
	<pfalcato@...e.de>, Aubrey Li <aubrey.li@...el.com>, Tim Chen
	<tim.c.chen@...el.com>, "Huang, Ying" <ying.huang@...ux.alibaba.com>,
	<linux-doc@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<linux-mm@...ck.org>, Xunlei Pang <xlpang@...ux.alibaba.com>,
	<huifeng.le@...el.com>
Subject: Re: [PATCH v2] sched/numa: Introduce per cgroup numa balance control

Hi Michal,

Thanks for taking a look.

On 6/25/2025 8:19 PM, Michal Koutný wrote:
> On Wed, Jun 25, 2025 at 06:23:37PM +0800, Chen Yu <yu.c.chen@...el.com> wrote:
>> [Problem Statement]
>> Currently, NUMA balancing is configured system-wide.
>> However, in some production environments, different
>> cgroups may have varying requirements for NUMA balancing.
>> Some cgroups are CPU-intensive, while others are
>> memory-intensive. Some do not benefit from NUMA balancing
>> due to the overhead associated with VMA scanning, while
>> others prefer NUMA balancing as it helps improve memory
>> locality. In this case, system-wide NUMA balancing is
>> usually disabled to avoid causing regressions.
>>
>> [Proposal]
>> Introduce a per-cgroup interface to enable NUMA balancing
>> for specific cgroups.
> 
> The balancing works with task granularity already and this new attribute
> is not much of a resource to control.
> Have you considered a per-task attribute? (sched_setattr(), prctl() or
> similar) That one could be inherited and respective cgroups would be
> seeded with a process with intended values.

OK, the prctl approach should work. However, setting this
attribute via cgroup might be more convenient for the userspace
IMHO. The original requirement stems from cloud environments,
where it's typically unacceptable to require applications to
modify their code to add prctl(). Thus, the orchestration layer
must handle this. For example, the initial process of the container
needs adjustment. After consulting with cloud-native developers,
I learned that containerd-shim-runc-v2 serves as the first
process. Therefore, we may need to modify the
containerd-shim-runc-v2 code to use prctl for the NUMA
balancing attribute, allowing child processes to inherit the
settings. While if it is per cgroup control, the user can just
touch one sysfs item.

> And cpuset could be
> traditionally used to restrict the scope of balancing of such tasks.
> 
> WDYT?
> 

In some scenarios, cgroups serve as micro-service containers.
They are not bound to any CPU sets and instead run freely on all
online CPUs. These cgroups can be sensitive to CPU capacity, as well
as NUMA locality (involving page migration and task migration).

>> This interface is associated with the CPU subsystem, which
>> does not support threaded subtrees, and close to CPU bandwidth
>> control.
>   (??) does support
> 

Ah yes, it supports threaded cgroup type. In this case, we
might need to disable the per-cgroup NUMA balance for threaded
cgroup type.

>> The system administrator needs to set the NUMA balancing mode to
>> NUMA_BALANCING_CGROUP=4 to enable this feature. When the system is in
>> NUMA_BALANCING_CGROUP mode, NUMA balancing for all cgroups is disabled
>> by default. After the administrator enables this feature for a
>> specific cgroup, NUMA balancing for that cgroup is enabled.
> 
> How much dynamic do you such changes to be? In relation to given
> cgroup's/process's lifecycle.
> 

I think it depends on the design. Starting from Kubernetes v1.33,
there is a feature called "in-place Pod resize," which allows users
to modify CPU and memory requests and limits for containers(via
cgroup interfaces) in a running Pod — often without needing to
restart the container. That said, if an admin wants to adjust
NUMA balancing settings at runtime (after the monitor detects
excessive remote NUMA memory access), using prctl might require
iterating through each process in the cgroup and invoking prctl
on them individually.

thanks,
Chenyu

> Thanks,
> Michal