linux-kernel - Re: [RFC/PATCH] sched: Support moving kthreads into cpuset cgroups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAOBoifh4BY1f4B3EfDvqWCxNSV8zwmJPNoR3bLOA7YO11uGBCQ@mail.gmail.com>
Date: Tue, 6 May 2025 20:43:57 -0700
From: Xi Wang <xii@...gle.com>
To: Tejun Heo <tj@...nel.org>
Cc: linux-kernel@...r.kernel.org, cgroups@...r.kernel.org, 
	Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, David Rientjes <rientjes@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>, Waiman Long <longman@...hat.com>, 
	Johannes Weiner <hannes@...xchg.org>, Michal Koutný <mkoutny@...e.com>, 
	Lai Jiangshan <jiangshanlai@...il.com1>, Frederic Weisbecker <frederic@...nel.org>, 
	Vlastimil Babka <vbabka@...e.cz>, Dan Carpenter <dan.carpenter@...aro.org>, Chen Yu <yu.c.chen@...el.com>, 
	Kees Cook <kees@...nel.org>, Yu-Chun Lin <eleanor15x@...il.com>, 
	Thomas Gleixner <tglx@...utronix.de>, Mickaël Salaün <mic@...ikod.net>
Subject: Re: [RFC/PATCH] sched: Support moving kthreads into cpuset cgroups

On Tue, May 6, 2025 at 5:17 PM Tejun Heo <tj@...nel.org> wrote:
>
> Hello,
>
> On Tue, May 06, 2025 at 11:35:32AM -0700, Xi Wang wrote:
> > In theory we should be able to manage kernel tasks with cpuset
> > cgroups just like user tasks, would be a flexible way to limit
> > interferences to real-time and other sensitive workloads. This is
> > however not supported today: When setting cpu affinity for kthreads,
> > kernel code uses a simpler control path that directly lead to
> > __set_cpus_allowed_ptr or __ktread_bind_mask. Neither honors cpuset
> > restrictions.
> >
> > This patch adds cpuset support for kernel tasks by merging userspace
> > and kernel cpu affinity control paths and applying the same
> > restrictions to kthreads.
> >
> > The PF_NO_SETAFFINITY flag is still supported for tasks that have to
> > run with certain cpu affinities. Kernel ensures kthreads with this
> > flag have their affinities locked and they stay in the root cpuset:
> >
> > If userspace moves kthreadd out of the root cpuset (see example
> > below), a newly forked kthread will be in a non root cgroup as well.
> > If PF_NO_SETAFFINITY is detected for the kthread, it will move itself
> > into the root cpuset before the threadfn is called. This does depend
> > on the kthread create -> kthread bind -> wake up sequence.
>
> Can you describe the use cases in detail? This is not in line with the
> overall direction. e.g. We're making cpuset work with housekeeping mechanism
> and tell workqueue which CPUs can be used for unbound execution and kthreads
> which are closely tied to userspace activities are spawned into the same
> cgroups as the user thread and subject to usual resource control.
>
> There are a lot of risks in subjecting arbitrary kthreads to all cgroup
> resource controls and just allowing cpuset doesn't seem like a great idea.
> Integration through housekeeping makes a lot more sense to me. Note that
> even for just cpuset thread level control doesn't really work that well. All
> kthreads are forked by kthreadd. If you move the kthreadd into a cgroup, all
> kthreads includling kworkers for all workqueues will be spawned there. The
> granularity of control isn't much better than going through housekeeping.

For the use cases, there are two major requirements at the moment:

Dynamic cpu affinity based isolation: CPUs running latency sensitive threads
(vcpu threads) can change over time. We'd like to configure kernel thread
affinity at run time too. Changing cpu affinity at run time requires cpumask
calculations and thread migrations. Sharing cpuset code would be nice.

Support numa based memory daemon affinity: We'd like to restrict kernel memory
daemons but maintain their numa affinity at the same time. cgroup hierarchies
can be helpful, e.g. create kernel, kernel/node0 and kernel/node1 and move the
daemons to the right cgroup.

Workqueue coverage is optional. kworker threads can use their separate
mechanisms too.

Since the goal is isolation, we'd like to restrict as many kthreads as possible,
even the ones that don't directly interact with user applications.

The kthreadd case is handled - a new kthread can be forked inside a non root
cgroup, but based on flags it can move itself to the root cgroup before threadfn
is called.

-Xi