[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8d406a89-3937-4fa2-84a9-624db3c75c76@linux.alibaba.com>
Date: Thu, 4 Sep 2025 16:10:17 +0800
From: escape <escape@...ux.alibaba.com>
To: Tejun Heo <tj@...nel.org>
Cc: hannes@...xchg.org, mkoutny@...e.com, cgroups@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] cgroup: replace global percpu_rwsem with
signal_struct->group_rwsem when writing cgroup.procs/threads
在 2025/9/4 15:28, Tejun Heo 写道:
> Hello,
>
> On Thu, Sep 04, 2025 at 11:15:26AM +0800, escape wrote:
>> 在 2025/9/4 00:53, Tejun Heo 写道:
>>> Hello,
> ...
>> As Ridong pointed out, in the current code, using CLONE_INTO_CGROUP
>> still requires holding the threadgroup_rwsem, so contention with fork
>> operations persists.
> Sorry about my fumbling explanations repeatedly but this isn't true. On
> cgroup2, if you create a cgroup, enable controllers and then seed it with
> CLONE_INTO_CGROUP, threadgroup_rwsem is out of the picture. The only
> remaining contention point is cgroup_mutex.
>
>> CLONE_INTO_CGROUP helps alleviate the contention between cgroup creation
>> and deletion, but its usage comes with significant limitations:
>>
>> 1. CLONE_INTO_CGROUP is only available in cgroup v2. Although cgroup v2
>> adoption is gradually increasing, many applications have not yet been
>> adapted to cgroup v2, and phasing out cgroup v1 will be a long and
>> gradual process.
>>
>> 2. CLONE_INTO_CGROUP requires specifying the cgroup file descriptor at the
>> time of process fork, effectively restricting cgroup migration to the
>> fork stage. This differs significantly from the typical cgroup attach
>> workflow. For example, in Kubernetes, systemd is the recommended cgroup
>> driver; kubelet communicates with systemd via D-Bus, and systemd
>> performs the actual cgroup attachment. In this case, the process being
>> attached typically does not have systemd as its parent. Using
>> CLONE_INTO_CGROUP in such a scenario is impractical and would require
>> coordinated changes to both systemd and kubelet.
> A percpu rwsem (threadgroup_rwsem) was used instead of per-threadgroup
> locking to avoid adding overhead to hot paths - fork and exit - because
> cgroup operations were expected to be a lot colder. Now, threadgroup rwsem
> is *really* expensive for the writers, so the trade-off could be a bit too
> extreme for some use cases.
>
> However, now that the most common usage pattern doesn't involve
> threadgroup_rwsem, I don't feel too enthusiastic about adding hot path
> overhead to work around usage patterns that we want to move away from. Note
> that dynamic migrations have other more fundamental problems for stateful
> resources and we generally want to move away from it. Sure, a single rwsem
> operation in fork/exit isn't a lot of overhead but it isn't nothing either
> and this will impact everybody.
>
> Maybe we can make it a mount option so that use cases that still depend on
> it can toggle it on? In fact, there's already favordynmods mount option
> which seems like a good fit. Maybe put the extra locking behind that flag?
>
> Thanks.
>
Thank you for your reply.
I agree that mounting with cgroupv2, creating a cgroup, enabling
controllers,
and then seeding it with CLONE_INTO_CGROUP is an excellent solution.
However,
we've encountered significant obstacles when applying this approach on some
older systems. We are simultaneously working on enabling CLONE_INTO_CGROUP
support in runc, systemd, and other components, but this will take some
time.
This patch aims to alleviate the issue to some extent during this
transitional
period.
Regarding the impact of the extra rwsem operations on hot paths, I have
conducted
performance testing. In cases where there is no contention on down_write,
the UnixBench spawn test scores remain unaffected.
The suggestion to use the favordynmods flag to control whether the extra
rwsem
is used is excellent. I will incorporate this condition in the next
version of
the patch.
Thanks.
Powered by blists - more mailing lists