[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aL7-4XQxeFKtFWlq@slm.duckdns.org>
Date: Mon, 8 Sep 2025 06:05:53 -1000
From: Tejun Heo <tj@...nel.org>
To: Yi Tao <escape@...ux.alibaba.com>
Cc: hannes@...xchg.org, mkoutny@...e.com, cgroups@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 1/1] cgroup: replace global percpu_rwsem with per
threadgroup resem when writing to cgroup.procs
Hello,
On Mon, Sep 08, 2025 at 06:20:27PM +0800, Yi Tao wrote:
...
> The static usage pattern of creating a cgroup, enabling controllers,
> and then seeding it with CLONE_INTO_CGROUP doesn't require write
> locking cgroup_threadgroup_rwsem and thus doesn't benefit from this
> patch.
Please bring this to the top, note that this is the default mode of
operation and the mechanism being introduced is thus an optional one.
> @@ -88,7 +88,8 @@ enum {
> /*
> * Reduce latencies on dynamic cgroup modifications such as task
> * migrations and controller on/offs by disabling percpu operation on
> - * cgroup_threadgroup_rwsem. This makes hot path operations such as
> + * cgroup_threadgroup_rwsem and taking per threadgroup rwsem when
> + * writing to cgroup.procs. This makes hot path operations such as
> * forks and exits into the slow path and more expensive.
This comment is pointed to from all other places. Please expand on why
per-threadgroup rwsem is beneficial for what use cases.
> * The static usage pattern of creating a cgroup, enabling controllers,
> @@ -828,16 +829,21 @@ struct cgroup_of_peak {
> struct list_head list;
> };
>
> +extern int take_per_threadgroup_rwsem;
I think it needs cgroup in its name. Maybe something like
cgroup_enable_per_threadgroup_rwsem?
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 312c6a8b55bb..8650ec394d0c 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1298,18 +1298,29 @@ struct cgroup_root *cgroup_root_from_kf(struct kernfs_root *kf_root)
> return root_cgrp->root;
> }
>
> +int take_per_threadgroup_rwsem;
Please put it where other global variables are and also note what it does
and that writes protected by cgroup_mutex and write-lock of
cgroup_threadgroup_rwsem and thus reads are protected by either.
> void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
> {
> bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
>
> - /* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
> + /*
> + * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition.
> + * favordynmods can flip while task is between
> + * cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end,
> + * so down_write global cgroup_threadgroup_rwsem to synchronize them.
Maybe: take_per_threadgroup_rwsem must not be flipped while threads are
between cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end.
down_write global group_threadgroup_rwsem to exclude them.
But does this actually work? It works for turning it on. I don't think it'd
work for turning it off, right? Maybe make it enable only and trigger a
warning message when people try to turn it off?
> + */
> + percpu_down_write(&cgroup_threadgroup_rwsem);
> if (favor && !favoring) {
> + take_per_threadgroup_rwsem++;
Given that favoring is gating the switch, this can be a bool, right?
> rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
> root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
> } else if (!favor && favoring) {
> + take_per_threadgroup_rwsem--;
And here, you can trigger a warning that per_threadgroup opreation can't be
disabled once enabled instead of actually turning it off.
Another alternative would be using a task flag to track whether %current is
holding per_threadgroup_rwsem and then using that to decide whether to
unlock. Maybe that's cleaner but I don't think it really matters here.
..
> + * When favordynmods is enabled, take per threadgroup rwsem to reduce latencies
> + * on dynamic cgroup modifications. see the comment above
> + * CGRP_ROOT_FAVOR_DYNMODS definition.
This is more about scalability, right? Maybe just say overhead?
> @@ -2976,24 +3003,13 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
> if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
> return ERR_PTR(-EINVAL);
>
> - /*
> - * If we migrate a single thread, we don't care about threadgroup
> - * stability. If the thread is `current`, it won't exit(2) under our
> - * hands or change PID through exec(2). We exclude
> - * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
> - * callers by cgroup_mutex.
> - * Therefore, we can skip the global lock.
> - */
> - lockdep_assert_held(&cgroup_mutex);
> - *threadgroup_locked = pid || threadgroup;
> - cgroup_attach_lock(*threadgroup_locked);
> -
> +retry_find_task:
> rcu_read_lock();
> if (pid) {
> tsk = find_task_by_vpid(pid);
> if (!tsk) {
> tsk = ERR_PTR(-ESRCH);
> - goto out_unlock_threadgroup;
> + goto out_unlock_rcu;
> }
> } else {
> tsk = current;
> @@ -3010,15 +3026,42 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
> */
> if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
> tsk = ERR_PTR(-EINVAL);
> - goto out_unlock_threadgroup;
> + goto out_unlock_rcu;
> }
> -
> get_task_struct(tsk);
> - goto out_unlock_rcu;
>
> -out_unlock_threadgroup:
> - cgroup_attach_unlock(*threadgroup_locked);
> - *threadgroup_locked = false;
> + rcu_read_unlock();
> +
> + /*
> + * If we migrate a single thread, we don't care about threadgroup
> + * stability. If the thread is `current`, it won't exit(2) under our
> + * hands or change PID through exec(2). We exclude
> + * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
> + * callers by cgroup_mutex.
> + * Therefore, we can skip the global lock.
> + */
> + lockdep_assert_held(&cgroup_mutex);
> + *threadgroup_locked = pid || threadgroup;
> +
> + cgroup_attach_lock(*threadgroup_locked, tsk);
> +
> + if (threadgroup) {
> + if (!thread_group_leader(tsk)) {
> + /*
> + * a race with de_thread from another thread's exec()
> + * may strip us of our leadership, if this happens,
> + * there is no choice but to throw this task away and
> + * try again; this is
> + * "double-double-toil-and-trouble-check locking".
> + */
> + cgroup_attach_unlock(*threadgroup_locked, tsk);
> + put_task_struct(tsk);
> + goto retry_find_task;
This is subtle. Can you please separate this out to a spearate patch?
Thanks.
--
tejun
Powered by blists - more mailing lists