linux-kernel - Re: [PATCH v3 1/1] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aL7-4XQxeFKtFWlq@slm.duckdns.org>
Date: Mon, 8 Sep 2025 06:05:53 -1000
From: Tejun Heo <tj@...nel.org>
To: Yi Tao <escape@...ux.alibaba.com>
Cc: hannes@...xchg.org, mkoutny@...e.com, cgroups@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 1/1] cgroup: replace global percpu_rwsem with per
 threadgroup resem when writing to cgroup.procs

Hello,

On Mon, Sep 08, 2025 at 06:20:27PM +0800, Yi Tao wrote:
...
> The static usage pattern of creating a cgroup, enabling controllers,
> and then seeding it with CLONE_INTO_CGROUP doesn't require write
> locking cgroup_threadgroup_rwsem and thus doesn't benefit from this
> patch.

Please bring this to the top, note that this is the default mode of
operation and the mechanism being introduced is thus an optional one.

> @@ -88,7 +88,8 @@ enum {
>  	/*
>  	 * Reduce latencies on dynamic cgroup modifications such as task
>  	 * migrations and controller on/offs by disabling percpu operation on
> -	 * cgroup_threadgroup_rwsem. This makes hot path operations such as
> +	 * cgroup_threadgroup_rwsem and taking per threadgroup rwsem when
> +	 * writing to cgroup.procs. This makes hot path operations such as
>  	 * forks and exits into the slow path and more expensive.

This comment is pointed to from all other places. Please expand on why
per-threadgroup rwsem is beneficial for what use cases.

>  	 * The static usage pattern of creating a cgroup, enabling controllers,
> @@ -828,16 +829,21 @@ struct cgroup_of_peak {
>  	struct list_head	list;
>  };
>  
> +extern int take_per_threadgroup_rwsem;

I think it needs cgroup in its name. Maybe something like
cgroup_enable_per_threadgroup_rwsem?

> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 312c6a8b55bb..8650ec394d0c 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1298,18 +1298,29 @@ struct cgroup_root *cgroup_root_from_kf(struct kernfs_root *kf_root)
>  	return root_cgrp->root;
>  }
>  
> +int take_per_threadgroup_rwsem;

Please put it where other global variables are and also note what it does
and that writes protected by cgroup_mutex and write-lock of
cgroup_threadgroup_rwsem and thus reads are protected by either.

>  void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
>  {
>  	bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
>  
> -	/* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
> +	/*
> +	 * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition.
> +	 * favordynmods can flip while task is between
> +	 * cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end,
> +	 * so down_write global cgroup_threadgroup_rwsem to synchronize them.

Maybe: take_per_threadgroup_rwsem must not be flipped while threads are
between cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end.
down_write global group_threadgroup_rwsem to exclude them.

But does this actually work? It works for turning it on. I don't think it'd
work for turning it off, right? Maybe make it enable only and trigger a
warning message when people try to turn it off?

> +	 */
> +	percpu_down_write(&cgroup_threadgroup_rwsem);
>  	if (favor && !favoring) {
> +		take_per_threadgroup_rwsem++;

Given that favoring is gating the switch, this can be a bool, right?

>  		rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
>  		root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
>  	} else if (!favor && favoring) {
> +		take_per_threadgroup_rwsem--;

And here, you can trigger a warning that per_threadgroup opreation can't be
disabled once enabled instead of actually turning it off.

Another alternative would be using a task flag to track whether %current is
holding per_threadgroup_rwsem and then using that to decide whether to
unlock. Maybe that's cleaner but I don't think it really matters here.

..
> + * When favordynmods is enabled, take per threadgroup rwsem to reduce latencies
> + * on dynamic cgroup modifications. see the comment above
> + * CGRP_ROOT_FAVOR_DYNMODS definition.

This is more about scalability, right? Maybe just say overhead?

> @@ -2976,24 +3003,13 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>  	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
>  		return ERR_PTR(-EINVAL);
>  
> -	/*
> -	 * If we migrate a single thread, we don't care about threadgroup
> -	 * stability. If the thread is `current`, it won't exit(2) under our
> -	 * hands or change PID through exec(2). We exclude
> -	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
> -	 * callers by cgroup_mutex.
> -	 * Therefore, we can skip the global lock.
> -	 */
> -	lockdep_assert_held(&cgroup_mutex);
> -	*threadgroup_locked = pid || threadgroup;
> -	cgroup_attach_lock(*threadgroup_locked);
> -
> +retry_find_task:
>  	rcu_read_lock();
>  	if (pid) {
>  		tsk = find_task_by_vpid(pid);
>  		if (!tsk) {
>  			tsk = ERR_PTR(-ESRCH);
> -			goto out_unlock_threadgroup;
> +			goto out_unlock_rcu;
>  		}
>  	} else {
>  		tsk = current;
> @@ -3010,15 +3026,42 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>  	 */
>  	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
>  		tsk = ERR_PTR(-EINVAL);
> -		goto out_unlock_threadgroup;
> +		goto out_unlock_rcu;
>  	}
> -
>  	get_task_struct(tsk);
> -	goto out_unlock_rcu;
>  
> -out_unlock_threadgroup:
> -	cgroup_attach_unlock(*threadgroup_locked);
> -	*threadgroup_locked = false;
> +	rcu_read_unlock();
> +
> +	/*
> +	 * If we migrate a single thread, we don't care about threadgroup
> +	 * stability. If the thread is `current`, it won't exit(2) under our
> +	 * hands or change PID through exec(2). We exclude
> +	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
> +	 * callers by cgroup_mutex.
> +	 * Therefore, we can skip the global lock.
> +	 */
> +	lockdep_assert_held(&cgroup_mutex);
> +	*threadgroup_locked = pid || threadgroup;
> +
> +	cgroup_attach_lock(*threadgroup_locked, tsk);
> +
> +	if (threadgroup) {
> +		if (!thread_group_leader(tsk)) {
> +			/*
> +			 * a race with de_thread from another thread's exec()
> +			 * may strip us of our leadership, if this happens,
> +			 * there is no choice but to throw this task away and
> +			 * try again; this is
> +			 * "double-double-toil-and-trouble-check locking".
> +			 */
> +			cgroup_attach_unlock(*threadgroup_locked, tsk);
> +			put_task_struct(tsk);
> +			goto retry_find_task;

This is subtle. Can you please separate this out to a spearate patch?

Thanks.

-- 
tejun