linux-kernel - Re: [RFC/PATCH] sched: Support moving kthreads into cpuset cgroups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAOBoifjzJ=-siSR=2=3FtKwajSgkXsL40XO2pox0XR4c8vvkzg@mail.gmail.com>
Date: Thu, 8 May 2025 15:39:37 -0700
From: Xi Wang <xii@...gle.com>
To: Waiman Long <llong@...hat.com>
Cc: Frederic Weisbecker <frederic@...nel.org>, Tejun Heo <tj@...nel.org>, linux-kernel@...r.kernel.org, 
	cgroups@...r.kernel.org, Ingo Molnar <mingo@...hat.com>, 
	Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, 
	David Rientjes <rientjes@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>, Johannes Weiner <hannes@...xchg.org>, 
	Michal Koutný <mkoutny@...e.com>, 
	Vlastimil Babka <vbabka@...e.cz>, Dan Carpenter <dan.carpenter@...aro.org>, Chen Yu <yu.c.chen@...el.com>, 
	Kees Cook <kees@...nel.org>, Yu-Chun Lin <eleanor15x@...il.com>, 
	Thomas Gleixner <tglx@...utronix.de>, Mickaël Salaün <mic@...ikod.net>, 
	jiangshanlai@...il.com
Subject: Re: [RFC/PATCH] sched: Support moving kthreads into cpuset cgroups

On Thu, May 8, 2025 at 12:35 PM Waiman Long <llong@...hat.com> wrote:
>
> On 5/8/25 1:51 PM, Xi Wang wrote:
> > I think our problem spaces are different. Perhaps your problems are closer to
> > hard real-time systems but our problems are about improving latency of existing
> > systems while maintaining efficiency (max supported cpu util).
> >
> > For hard real-time systems we sometimes throw cores at the problem and run no
> > more than one thread per cpu. But if we want efficiency we have to share cpus
> > with scheduling policies. Disconnecting the cpu scheduler with isolcpus results
> > in losing too much of the machine capacity. CPU scheduling is needed for both
> > kernel and userspace threads.
> >
> > For our use case we need to move kernel threads away from certain vcpu threads,
> > but other vcpu threads can share cpus with kernel threads. The ratio changes
> > from time to time. Permanently putting aside a few cpus results in a reduction
> > in machine capacity.
> >
> > The PF_NO_SETAFFINTIY case is already handled by the patch. These threads will
> > run in root cgroup with affinities just like before.
> >
> > The original justifications for the cpuset feature is here and the reasons are
> > still applicable:
> >
> > "The management of large computer systems, with many processors (CPUs), complex
> > memory cache hierarchies and multiple Memory Nodes having non-uniform access
> > times (NUMA) presents additional challenges for the efficient scheduling and
> > memory placement of processes."
> >
> > "But larger systems, which benefit more from careful processor and memory
> > placement to reduce memory access times and contention.."
> >
> > "These subsets, or “soft partitions” must be able to be dynamically adjusted, as
> > the job mix changes, without impacting other concurrently executing jobs."
> >
> > https://docs.kernel.org/admin-guide/cgroup-v1/cpusets.html
> >
> > -Xi
> >
> If you create a cpuset root partition, we are pushing some kthreads
> aways from CPUs dedicated to the newly created partition which has its
> own scheduling domain separate from the cgroup root. I do realize that
> the current way of excluding only per cpu kthreads isn't quite right. So
> I send out a new patch to extend to all the PF_NO_SETAFFINITY kthreads.
>
> So instead of putting kthreads into the dedicated cpuset, we still keep
> them in the root cgroup. Instead we can create a separate cpuset
> partition to run the workload without interference from the background
> kthreads. Will that functionality suit your current need?
>
> Cheers,
> Longman
>

It's likely a major improvement over a fixed partition but maybe still not fully
flexible. I am not familiar with cpuset partitions but I wonder if the following
case can be supported:

Starting from
16 cpus
Root has cpu 0-3, 8-15
Job A has cpu 4-7 exclusive
Kernel threads cannot run on cpu 4-8 which is good.

Now adding best effort Job B, which is under SCHED_IDLE and rarely enters kernel
mode. As we expect C can be easily preempted we allow it to share cpus with A
and kernel threads to maximize throughput. Is there a layout that supports the
requirements below?

Job C threads on cpu 0-15
Job A threads on cpu 4-7
No kernel threads on cpu 4-7

-Xi