[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d894e74f-6499-4a41-8744-efbcf279a9d9@redhat.com>
Date: Mon, 8 Sep 2025 15:39:48 -0400
From: Waiman Long <llong@...hat.com>
To: Yi Tao <escape@...ux.alibaba.com>, tj@...nel.org, hannes@...xchg.org,
mkoutny@...e.com
Cc: cgroups@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 1/1] cgroup: replace global percpu_rwsem with per
threadgroup resem when writing to cgroup.procs
On 9/8/25 6:20 AM, Yi Tao wrote:
> As computer hardware advances, modern systems are typically equipped
> with many CPU cores and large amounts of memory, enabling the deployment
> of numerous applications. On such systems, container creation and
> deletion become frequent operations, making cgroup process migration no
> longer a cold path. This leads to noticeable contention with common
> process operations such as fork, exec, and exit.
>
> To alleviate the contention between cgroup process migration and
> operations like process fork, this patch modifies lock to take the write
> lock on signal_struct->group_rwsem when writing pid to
> cgroup.procs/threads instead of holding a global write lock.
>
> Cgroup process migration has historically relied on
> signal_struct->group_rwsem to protect thread group integrity. In commit
> <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
> a global percpu_rwsem"), this was changed to a global
> cgroup_threadgroup_rwsem. The advantage of using a global lock was
> simplified handling of process group migrations. This patch retains the
> use of the global lock for protecting process group migration, while
> reducing contention by using per thread group lock during
> cgroup.procs/threads writes.
>
> The locking behavior is as follows:
>
> write cgroup.procs/threads | process fork,exec,exit | process group migration
> ------------------------------------------------------------------------------
> cgroup_lock() | down_read(&g_rwsem) | cgroup_lock()
> down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem)
> critical section | critical section | critical section
> up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem)
> cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock()
>
> g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
> signal_struct->group_rwsem.
>
> This patch eliminates contention between cgroup migration and fork
> operations for threads that belong to different thread groups, thereby
> reducing the long-tail latency of cgroup migrations and lowering system
> load.
>
> With this patch, under heavy fork and exec interference, the long-tail
> latency of cgroup migration has been reduced from milliseconds to
> microseconds. Under heavy cgroup migration interference, the multi-CPU
> score of the spawn test case in UnixBench increased by 9%.
>
> The static usage pattern of creating a cgroup, enabling controllers,
> and then seeding it with CLONE_INTO_CGROUP doesn't require write
> locking cgroup_threadgroup_rwsem and thus doesn't benefit from this
> patch.
>
> To avoid affecting other users, the per threadgroup rwsem is only used
> when the `favordynmods` flag is enabled.
>
> Signed-off-by: Yi Tao <escape@...ux.alibaba.com>
> ---
> include/linux/cgroup-defs.h | 12 +++-
> include/linux/sched/signal.h | 4 ++
> init/init_task.c | 3 +
> kernel/cgroup/cgroup-internal.h | 4 +-
> kernel/cgroup/cgroup-v1.c | 8 +--
> kernel/cgroup/cgroup.c | 105 ++++++++++++++++++++++----------
> kernel/fork.c | 4 ++
> 7 files changed, 101 insertions(+), 39 deletions(-)
>
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 6b93a64115fe..5033e3bdac9e 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -88,7 +88,8 @@ enum {
> /*
> * Reduce latencies on dynamic cgroup modifications such as task
> * migrations and controller on/offs by disabling percpu operation on
> - * cgroup_threadgroup_rwsem. This makes hot path operations such as
> + * cgroup_threadgroup_rwsem and taking per threadgroup rwsem when
> + * writing to cgroup.procs. This makes hot path operations such as
> * forks and exits into the slow path and more expensive.
> *
> * The static usage pattern of creating a cgroup, enabling controllers,
> @@ -828,16 +829,21 @@ struct cgroup_of_peak {
> struct list_head list;
> };
>
> +extern int take_per_threadgroup_rwsem;
> +
> /**
> * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups
> * @tsk: target task
> *
> * Allows cgroup operations to synchronize against threadgroup changes
> - * using a percpu_rw_semaphore.
> + * using a global percpu_rw_semaphore and a per threadgroup rw_semaphore when
> + * favordynmods is on. See the comment above CGRP_ROOT_FAVOR_DYNMODS definition.
> */
> static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
> {
> percpu_down_read(&cgroup_threadgroup_rwsem);
> + if (take_per_threadgroup_rwsem)
> + down_read(&tsk->signal->cgroup_threadgroup_rwsem);
> }
>
> /**
> @@ -848,6 +854,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
> */
> static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
> {
> + if (take_per_threadgroup_rwsem)
> + up_read(&tsk->signal->cgroup_threadgroup_rwsem);
> percpu_up_read(&cgroup_threadgroup_rwsem);
> }
>
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 1ef1edbaaf79..7d6449982822 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -226,6 +226,10 @@ struct signal_struct {
> struct tty_audit_buf *tty_audit_buf;
> #endif
>
> +#ifdef CONFIG_CGROUPS
> + struct rw_semaphore cgroup_threadgroup_rwsem;
> +#endif
> +
> /*
> * Thread is the potential origin of an oom condition; kill first on
> * oom
> diff --git a/init/init_task.c b/init/init_task.c
> index e557f622bd90..a55e2189206f 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
> },
> .multiprocess = HLIST_HEAD_INIT,
> .rlim = INIT_RLIMITS,
> +#ifdef CONFIG_CGROUPS
> + .cgroup_threadgroup_rwsem = __RWSEM_INITIALIZER(init_signals.cgroup_threadgroup_rwsem),
> +#endif
> .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
> .exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
> #ifdef CONFIG_POSIX_TIMERS
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index b14e61c64a34..318cc7f22e2c 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -249,8 +249,8 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
>
> int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
> bool threadgroup);
> -void cgroup_attach_lock(bool lock_threadgroup);
> -void cgroup_attach_unlock(bool lock_threadgroup);
> +void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk);
> +void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk);
> struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
> bool *locked)
> __acquires(&cgroup_threadgroup_rwsem);
> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
> index 2a4a387f867a..65e9b454780c 100644
> --- a/kernel/cgroup/cgroup-v1.c
> +++ b/kernel/cgroup/cgroup-v1.c
> @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
> int retval = 0;
>
> cgroup_lock();
> - cgroup_attach_lock(true);
> + cgroup_attach_lock(true, NULL);
> for_each_root(root) {
> struct cgroup *from_cgrp;
>
> @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
> if (retval)
> break;
> }
> - cgroup_attach_unlock(true);
> + cgroup_attach_unlock(true, NULL);
> cgroup_unlock();
>
> return retval;
> @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
>
> cgroup_lock();
>
> - cgroup_attach_lock(true);
> + cgroup_attach_lock(true, NULL);
>
> /* all tasks in @from are being moved, all csets are source */
> spin_lock_irq(&css_set_lock);
> @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
> } while (task && !ret);
> out_err:
> cgroup_migrate_finish(&mgctx);
> - cgroup_attach_unlock(true);
> + cgroup_attach_unlock(true, NULL);
> cgroup_unlock();
> return ret;
> }
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 312c6a8b55bb..8650ec394d0c 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1298,18 +1298,29 @@ struct cgroup_root *cgroup_root_from_kf(struct kernfs_root *kf_root)
> return root_cgrp->root;
> }
>
> +int take_per_threadgroup_rwsem;
I would suggest adding the "__read_mostly" attribute to avoid the chance
of false cacheline sharing.
> +
> void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
> {
> bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
>
> - /* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
> + /*
> + * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition.
> + * favordynmods can flip while task is between
> + * cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end,
> + * so down_write global cgroup_threadgroup_rwsem to synchronize them.
> + */
> + percpu_down_write(&cgroup_threadgroup_rwsem);
> if (favor && !favoring) {
> + take_per_threadgroup_rwsem++;
> rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
> root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
> } else if (!favor && favoring) {
> + take_per_threadgroup_rwsem--;
> rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
> root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
> }
> + percpu_up_write(&cgroup_threadgroup_rwsem);
> }
>
Changing take_per_threadgroup_rwsem inside the cgroup_threadgroup_rwsem
critical section will ensure that the flag won't change in between
cgroup_threadgroup_change_begin() and cgroup_threadgroup_change_end().
However, it may still change in between cgroup_attach_lock() and
cgroup_attach_unlock().
Since cgroup_attach_[un]lock() has already taken a threadgroup_locked
argument, we can extend it to a flag word that holds 2 flag bits - one
for threadgroup_locked and another one for whether the threadgroup rwsem
has been taken or not. So take_per_threadgroup_rwsem will only be
checked in cgroup_attach_lock(). cgroup_attach_lock() can either return
a value to indicate the state or the argument can be changed into a flag
pointer with the new flag bit added internally.
It should be a separate patch if you decide to do the extension.
Cheers,
Longman
Powered by blists - more mailing lists