linux-kernel - Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aLk_o0GUhC14T8f9@slm.duckdns.org>
Date: Wed, 3 Sep 2025 21:28:35 -1000
From: Tejun Heo <tj@...nel.org>
To: escape <escape@...ux.alibaba.com>
Cc: hannes@...xchg.org, mkoutny@...e.com, cgroups@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] cgroup: replace global percpu_rwsem with
 signal_struct->group_rwsem when writing cgroup.procs/threads

Hello,

On Thu, Sep 04, 2025 at 11:15:26AM +0800, escape wrote:
> 在 2025/9/4 00:53, Tejun Heo 写道:
> > Hello,
...
> As Ridong pointed out, in the current code, using CLONE_INTO_CGROUP
> still requires holding the threadgroup_rwsem, so contention with fork
> operations persists.

Sorry about my fumbling explanations repeatedly but this isn't true. On
cgroup2, if you create a cgroup, enable controllers and then seed it with
CLONE_INTO_CGROUP, threadgroup_rwsem is out of the picture. The only
remaining contention point is cgroup_mutex.

> CLONE_INTO_CGROUP helps alleviate the contention between cgroup creation
> and deletion, but its usage comes with significant limitations:
> 
> 1. CLONE_INTO_CGROUP is only available in cgroup v2. Although cgroup v2
> adoption is gradually increasing, many applications have not yet been
> adapted to cgroup v2, and phasing out cgroup v1 will be a long and
> gradual process.
> 
> 2. CLONE_INTO_CGROUP requires specifying the cgroup file descriptor at the
> time of process fork, effectively restricting cgroup migration to the
> fork stage. This differs significantly from the typical cgroup attach
> workflow. For example, in Kubernetes, systemd is the recommended cgroup
> driver; kubelet communicates with systemd via D-Bus, and systemd
> performs the actual cgroup attachment. In this case, the process being
> attached typically does not have systemd as its parent. Using
> CLONE_INTO_CGROUP in such a scenario is impractical and would require
> coordinated changes to both systemd and kubelet.

A percpu rwsem (threadgroup_rwsem) was used instead of per-threadgroup
locking to avoid adding overhead to hot paths - fork and exit - because
cgroup operations were expected to be a lot colder. Now, threadgroup rwsem
is *really* expensive for the writers, so the trade-off could be a bit too
extreme for some use cases.

However, now that the most common usage pattern doesn't involve
threadgroup_rwsem, I don't feel too enthusiastic about adding hot path
overhead to work around usage patterns that we want to move away from. Note
that dynamic migrations have other more fundamental problems for stateful
resources and we generally want to move away from it. Sure, a single rwsem
operation in fork/exit isn't a lot of overhead but it isn't nothing either
and this will impact everybody.

Maybe we can make it a mount option so that use cases that still depend on
it can toggle it on? In fact, there's already favordynmods mount option
which seems like a good fit. Maybe put the extra locking behind that flag?

Thanks.

-- 
tejun