linux-kernel - Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <93b146e5-08e4-4c10-bb90-8149e82263f0@linux.alibaba.com>
Date: Fri, 5 Sep 2025 11:44:53 +0800
From: escape <escape@...ux.alibaba.com>
To: Tejun Heo <tj@...nel.org>
Cc: hannes@...xchg.org, mkoutny@...e.com, cgroups@...r.kernel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with
 signal_struct->group_rwsem when writing cgroup.procs/threads

Hello,

在 2025/9/5 10:27, Tejun Heo 写道:
> Hello,
>
> On Fri, Sep 05, 2025 at 10:16:30AM +0800, escape wrote:
>>>> +	if (have_favordynmods)
>>>> +		up_read(&tsk->signal->group_rwsem);
>>>>    	percpu_up_read(&cgroup_threadgroup_rwsem);
>>> Hmm... I wonder whether turning on/off the flag is racy. ie. what prevents
>>> have_favordynmods flipping while a task is between change_begin and end?
>> have_favordynmods is read-only after initialization and will not change
>> during runtime.
> I don't think that's necessarily true. favordynmods can also be specified as
> a mount option and mount can race against forks, execs and exits. Also,
> IIRC, cgroup2 doesn't allow remounts but there's nothing preventing someone
> from unmounting and mounting it again with different options.
There are two ways to set favordynmods. The first is through kernel 
config or
the kernel command line parameter cgroup_favordynmods, which sets the 
value of
the variable have_favordynmods and automatically sets the 
CGRP_ROOT_FAVOR_DYNMODS
flag on cgroup_root->flags at mount time. The second is by passing a 
mount option,
which only sets the CGRP_ROOT_FAVOR_DYNMODS flag on cgroup_root->flags 
during mount.

In this patch, the decision whether to use the per-threadgroup rwsem is 
based on
have_favordynmods, not on cgroup_root->flags. An umount & mount affects 
cgroup_root->flags,
but does not change have_favordynmods.

Indeed, there is a minor flaw here: if favordynmods is enabled via a 
mount option rather
than the kernel command line, the per-threadgroup rwsem will not take 
effect.

To avoid the complexity of runtime changes to favordynmods, perhaps a 
better approach
would be to introduce a separate control variable, configured via a 
kernel command line
parameter such as cgroup_migration_lock=[global|thread_group] to 
explicitly govern
this behavior. What do you think about this approach?
>>>> @@ -3010,15 +3008,27 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>>>>    	 */
>>>>    	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
>>>>    		tsk = ERR_PTR(-EINVAL);
>>>> -		goto out_unlock_threadgroup;
>>>> +		goto out_unlock_rcu;
>>>>    	}
>>>> -
>>>>    	get_task_struct(tsk);
>>>> -	goto out_unlock_rcu;
>>>> -out_unlock_threadgroup:
>>>> -	cgroup_attach_unlock(*threadgroup_locked);
>>>> -	*threadgroup_locked = false;
>>>> +	rcu_read_unlock();
>>>> +
>>>> +	/*
>>>> +	 * If we migrate a single thread, we don't care about threadgroup
>>>> +	 * stability. If the thread is `current`, it won't exit(2) under our
>>>> +	 * hands or change PID through exec(2). We exclude
>>>> +	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
>>>> +	 * callers by cgroup_mutex.
>>>> +	 * Therefore, we can skip the global lock.
>>>> +	 */
>>>> +	lockdep_assert_held(&cgroup_mutex);
>>>> +	*threadgroup_locked = pid || threadgroup;
>>>> +
>>>> +	cgroup_attach_lock(tsk, *threadgroup_locked);
>>> I'm not sure this relocation is safe. What prevents e.g. @tsk changing its
>>> group leader or signal struct before lock is grabbed?
>> When a non-leader thread in a thread group executes the exec system call,
>> the thread group leader is updated, but the signal_struct remains unchanged,
>> so this part is safe.
> But the leader can change, right? So, we can end up in a situation where
> threadgroup is set but the task is not the leader which I think can lead to
> really subtle incorrect behaviors like write succeeding but nothing
> happening when racing against exec.
>
> Thanks.
>
You are right. If the thread group leader changes, the task passed to
cgroup_attach_task may not be the leader, which could lead to errors when
iterating through all threads in while_each_thread.


Similar to commit b78949ebfb56 ("cgroup: simplify double-check locking in
cgroup_attach_proc"), it should check whether the leader has changed after
acquiring the lock, to determine if a retry is needed. I will fix this in
the next version patch.

Thanks.