linux-kernel - Re: [PATCH v3 1/1] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d894e74f-6499-4a41-8744-efbcf279a9d9@redhat.com>
Date: Mon, 8 Sep 2025 15:39:48 -0400
From: Waiman Long <llong@...hat.com>
To: Yi Tao <escape@...ux.alibaba.com>, tj@...nel.org, hannes@...xchg.org,
 mkoutny@...e.com
Cc: cgroups@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 1/1] cgroup: replace global percpu_rwsem with per
 threadgroup resem when writing to cgroup.procs

On 9/8/25 6:20 AM, Yi Tao wrote:
> As computer hardware advances, modern systems are typically equipped
> with many CPU cores and large amounts of memory, enabling the deployment
> of numerous applications. On such systems, container creation and
> deletion become frequent operations, making cgroup process migration no
> longer a cold path. This leads to noticeable contention with common
> process operations such as fork, exec, and exit.
>
> To alleviate the contention between cgroup process migration and
> operations like process fork, this patch modifies lock to take the write
> lock on signal_struct->group_rwsem when writing pid to
> cgroup.procs/threads instead of holding a global write lock.
>
> Cgroup process migration has historically relied on
> signal_struct->group_rwsem to protect thread group integrity. In commit
> <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
> a global percpu_rwsem"), this was changed to a global
> cgroup_threadgroup_rwsem. The advantage of using a global lock was
> simplified handling of process group migrations. This patch retains the
> use of the global lock for protecting process group migration, while
> reducing contention by using per thread group lock during
> cgroup.procs/threads writes.
>
> The locking behavior is as follows:
>
> write cgroup.procs/threads  | process fork,exec,exit | process group migration
> ------------------------------------------------------------------------------
> cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
> down_write(&p_rwsem)        | down_read(&p_rwsem)    | down_write(&g_rwsem)
> critical section            | critical section       | critical section
> up_write(&p_rwsem)          | up_read(&p_rwsem)      | up_write(&g_rwsem)
> cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()
>
> g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
> signal_struct->group_rwsem.
>
> This patch eliminates contention between cgroup migration and fork
> operations for threads that belong to different thread groups, thereby
> reducing the long-tail latency of cgroup migrations and lowering system
> load.
>
> With this patch, under heavy fork and exec interference, the long-tail
> latency of cgroup migration has been reduced from milliseconds to
> microseconds. Under heavy cgroup migration interference, the multi-CPU
> score of the spawn test case in UnixBench increased by 9%.
>
> The static usage pattern of creating a cgroup, enabling controllers,
> and then seeding it with CLONE_INTO_CGROUP doesn't require write
> locking cgroup_threadgroup_rwsem and thus doesn't benefit from this
> patch.
>
> To avoid affecting other users, the per threadgroup rwsem is only used
> when the `favordynmods` flag is enabled.
>
> Signed-off-by: Yi Tao <escape@...ux.alibaba.com>
> ---
>   include/linux/cgroup-defs.h     |  12 +++-
>   include/linux/sched/signal.h    |   4 ++
>   init/init_task.c                |   3 +
>   kernel/cgroup/cgroup-internal.h |   4 +-
>   kernel/cgroup/cgroup-v1.c       |   8 +--
>   kernel/cgroup/cgroup.c          | 105 ++++++++++++++++++++++----------
>   kernel/fork.c                   |   4 ++
>   7 files changed, 101 insertions(+), 39 deletions(-)
>
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 6b93a64115fe..5033e3bdac9e 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -88,7 +88,8 @@ enum {
>   	/*
>   	 * Reduce latencies on dynamic cgroup modifications such as task
>   	 * migrations and controller on/offs by disabling percpu operation on
> -	 * cgroup_threadgroup_rwsem. This makes hot path operations such as
> +	 * cgroup_threadgroup_rwsem and taking per threadgroup rwsem when
> +	 * writing to cgroup.procs. This makes hot path operations such as
>   	 * forks and exits into the slow path and more expensive.
>   	 *
>   	 * The static usage pattern of creating a cgroup, enabling controllers,
> @@ -828,16 +829,21 @@ struct cgroup_of_peak {
>   	struct list_head	list;
>   };
>   
> +extern int take_per_threadgroup_rwsem;
> +
>   /**
>    * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups
>    * @tsk: target task
>    *
>    * Allows cgroup operations to synchronize against threadgroup changes
> - * using a percpu_rw_semaphore.
> + * using a global percpu_rw_semaphore and a per threadgroup rw_semaphore when
> + * favordynmods is on. See the comment above CGRP_ROOT_FAVOR_DYNMODS definition.
>    */
>   static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>   {
>   	percpu_down_read(&cgroup_threadgroup_rwsem);
> +	if (take_per_threadgroup_rwsem)
> +		down_read(&tsk->signal->cgroup_threadgroup_rwsem);
>   }
>   
>   /**
> @@ -848,6 +854,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>    */
>   static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
>   {
> +	if (take_per_threadgroup_rwsem)
> +		up_read(&tsk->signal->cgroup_threadgroup_rwsem);
>   	percpu_up_read(&cgroup_threadgroup_rwsem);
>   }
>   
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 1ef1edbaaf79..7d6449982822 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -226,6 +226,10 @@ struct signal_struct {
>   	struct tty_audit_buf *tty_audit_buf;
>   #endif
>   
> +#ifdef CONFIG_CGROUPS
> +	struct rw_semaphore cgroup_threadgroup_rwsem;
> +#endif
> +
>   	/*
>   	 * Thread is the potential origin of an oom condition; kill first on
>   	 * oom
> diff --git a/init/init_task.c b/init/init_task.c
> index e557f622bd90..a55e2189206f 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
>   	},
>   	.multiprocess	= HLIST_HEAD_INIT,
>   	.rlim		= INIT_RLIMITS,
> +#ifdef CONFIG_CGROUPS
> +	.cgroup_threadgroup_rwsem	= __RWSEM_INITIALIZER(init_signals.cgroup_threadgroup_rwsem),
> +#endif
>   	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
>   	.exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
>   #ifdef CONFIG_POSIX_TIMERS
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index b14e61c64a34..318cc7f22e2c 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -249,8 +249,8 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
>   
>   int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
>   		       bool threadgroup);
> -void cgroup_attach_lock(bool lock_threadgroup);
> -void cgroup_attach_unlock(bool lock_threadgroup);
> +void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk);
> +void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk);
>   struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>   					     bool *locked)
>   	__acquires(&cgroup_threadgroup_rwsem);
> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
> index 2a4a387f867a..65e9b454780c 100644
> --- a/kernel/cgroup/cgroup-v1.c
> +++ b/kernel/cgroup/cgroup-v1.c
> @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>   	int retval = 0;
>   
>   	cgroup_lock();
> -	cgroup_attach_lock(true);
> +	cgroup_attach_lock(true, NULL);
>   	for_each_root(root) {
>   		struct cgroup *from_cgrp;
>   
> @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>   		if (retval)
>   			break;
>   	}
> -	cgroup_attach_unlock(true);
> +	cgroup_attach_unlock(true, NULL);
>   	cgroup_unlock();
>   
>   	return retval;
> @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
>   
>   	cgroup_lock();
>   
> -	cgroup_attach_lock(true);
> +	cgroup_attach_lock(true, NULL);
>   
>   	/* all tasks in @from are being moved, all csets are source */
>   	spin_lock_irq(&css_set_lock);
> @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
>   	} while (task && !ret);
>   out_err:
>   	cgroup_migrate_finish(&mgctx);
> -	cgroup_attach_unlock(true);
> +	cgroup_attach_unlock(true, NULL);
>   	cgroup_unlock();
>   	return ret;
>   }
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 312c6a8b55bb..8650ec394d0c 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1298,18 +1298,29 @@ struct cgroup_root *cgroup_root_from_kf(struct kernfs_root *kf_root)
>   	return root_cgrp->root;
>   }
>   
> +int take_per_threadgroup_rwsem;
I would suggest adding the "__read_mostly" attribute to avoid the chance 
of false cacheline sharing.
> +
>   void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
>   {
>   	bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
>   
> -	/* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
> +	/*
> +	 * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition.
> +	 * favordynmods can flip while task is between
> +	 * cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end,
> +	 * so down_write global cgroup_threadgroup_rwsem to synchronize them.
> +	 */
> +	percpu_down_write(&cgroup_threadgroup_rwsem);
>   	if (favor && !favoring) {
> +		take_per_threadgroup_rwsem++;
>   		rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
>   		root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
>   	} else if (!favor && favoring) {
> +		take_per_threadgroup_rwsem--;
>   		rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
>   		root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
>   	}
> +	percpu_up_write(&cgroup_threadgroup_rwsem);
>   }
>   

Changing take_per_threadgroup_rwsem inside the cgroup_threadgroup_rwsem 
critical section will ensure that the flag won't change in between 
cgroup_threadgroup_change_begin() and cgroup_threadgroup_change_end(). 
However, it may still change in between cgroup_attach_lock() and 
cgroup_attach_unlock().

Since cgroup_attach_[un]lock() has already taken a threadgroup_locked 
argument, we can extend it to a flag word that holds 2 flag bits - one 
for threadgroup_locked and another one for whether the threadgroup rwsem 
has been taken or not. So take_per_threadgroup_rwsem will only be 
checked in cgroup_attach_lock(). cgroup_attach_lock() can either return 
a value to indicate the state or the argument can be changed into a flag 
pointer with the new flag bit added internally.

It should be a separate patch if you decide to do the extension.

Cheers,
Longman