linux-kernel - Re: [RFC/PATCH] sched: Support moving kthreads into cpuset cgroups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOBoifhXFKu-Y7ZtSBErEZTc+Zp_0-VY6o4A1KM5ii1uzN5iqQ@mail.gmail.com>
Date: Fri, 9 May 2025 09:52:01 -0700
From: Xi Wang <xii@...gle.com>
To: Waiman Long <llong@...hat.com>
Cc: Frederic Weisbecker <frederic@...nel.org>, Tejun Heo <tj@...nel.org>, linux-kernel@...r.kernel.org, 
	cgroups@...r.kernel.org, Ingo Molnar <mingo@...hat.com>, 
	Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, 
	David Rientjes <rientjes@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>, Johannes Weiner <hannes@...xchg.org>, 
	Michal Koutný <mkoutny@...e.com>, 
	Vlastimil Babka <vbabka@...e.cz>, Dan Carpenter <dan.carpenter@...aro.org>, Chen Yu <yu.c.chen@...el.com>, 
	Kees Cook <kees@...nel.org>, Yu-Chun Lin <eleanor15x@...il.com>, 
	Thomas Gleixner <tglx@...utronix.de>, Mickaël Salaün <mic@...ikod.net>, 
	jiangshanlai@...il.com
Subject: Re: [RFC/PATCH] sched: Support moving kthreads into cpuset cgroups

On Thu, May 8, 2025 at 5:30 PM Waiman Long <llong@...hat.com> wrote:
>
> On 5/8/25 6:39 PM, Xi Wang wrote:
> > On Thu, May 8, 2025 at 12:35 PM Waiman Long <llong@...hat.com> wrote:
> >> On 5/8/25 1:51 PM, Xi Wang wrote:
> >>> I think our problem spaces are different. Perhaps your problems are closer to
> >>> hard real-time systems but our problems are about improving latency of existing
> >>> systems while maintaining efficiency (max supported cpu util).
> >>>
> >>> For hard real-time systems we sometimes throw cores at the problem and run no
> >>> more than one thread per cpu. But if we want efficiency we have to share cpus
> >>> with scheduling policies. Disconnecting the cpu scheduler with isolcpus results
> >>> in losing too much of the machine capacity. CPU scheduling is needed for both
> >>> kernel and userspace threads.
> >>>
> >>> For our use case we need to move kernel threads away from certain vcpu threads,
> >>> but other vcpu threads can share cpus with kernel threads. The ratio changes
> >>> from time to time. Permanently putting aside a few cpus results in a reduction
> >>> in machine capacity.
> >>>
> >>> The PF_NO_SETAFFINTIY case is already handled by the patch. These threads will
> >>> run in root cgroup with affinities just like before.
> >>>
> >>> The original justifications for the cpuset feature is here and the reasons are
> >>> still applicable:
> >>>
> >>> "The management of large computer systems, with many processors (CPUs), complex
> >>> memory cache hierarchies and multiple Memory Nodes having non-uniform access
> >>> times (NUMA) presents additional challenges for the efficient scheduling and
> >>> memory placement of processes."
> >>>
> >>> "But larger systems, which benefit more from careful processor and memory
> >>> placement to reduce memory access times and contention.."
> >>>
> >>> "These subsets, or “soft partitions” must be able to be dynamically adjusted, as
> >>> the job mix changes, without impacting other concurrently executing jobs."
> >>>
> >>> https://docs.kernel.org/admin-guide/cgroup-v1/cpusets.html
> >>>
> >>> -Xi
> >>>
> >> If you create a cpuset root partition, we are pushing some kthreads
> >> aways from CPUs dedicated to the newly created partition which has its
> >> own scheduling domain separate from the cgroup root. I do realize that
> >> the current way of excluding only per cpu kthreads isn't quite right. So
> >> I send out a new patch to extend to all the PF_NO_SETAFFINITY kthreads.
> >>
> >> So instead of putting kthreads into the dedicated cpuset, we still keep
> >> them in the root cgroup. Instead we can create a separate cpuset
> >> partition to run the workload without interference from the background
> >> kthreads. Will that functionality suit your current need?
> >>
> >> Cheers,
> >> Longman
> >>
> > It's likely a major improvement over a fixed partition but maybe still not fully
> > flexible. I am not familiar with cpuset partitions but I wonder if the following
> > case can be supported:
> >
> > Starting from
> > 16 cpus
> > Root has cpu 0-3, 8-15
> > Job A has cpu 4-7 exclusive
> > Kernel threads cannot run on cpu 4-8 which is good.
> There will still be some kernel threads with PF_NO_SETAFFINITY flag set.
>
> >
> > Now adding best effort Job B, which is under SCHED_IDLE and rarely enters kernel
> > mode. As we expect C can be easily preempted we allow it to share cpus with A
> > and kernel threads to maximize throughput. Is there a layout that supports the
> > requirements below?
> >
> > Job C threads on cpu 0-15
>
> A task/thread can only be in one cpuset. So it cannot span all the CPUs.
> However, if there are multiples threads within the process, some of the
> threads can be moved to a different cpuset as it is threaded. With
> proper thread setting, you can have a job with threads spanning all the
> CPUs.
>
> Cheers,
> Longman
>
> > Job A threads on cpu 4-7
> > No kernel threads on cpu 4-7
> >
> > -Xi
> >
>

Partitions cannot have overlapping cpus but regular cpusets can. This is
probably where regular cpusets are still more flexible.

-Xi