lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8d406a89-3937-4fa2-84a9-624db3c75c76@linux.alibaba.com>
Date: Thu, 4 Sep 2025 16:10:17 +0800
From: escape <escape@...ux.alibaba.com>
To: Tejun Heo <tj@...nel.org>
Cc: hannes@...xchg.org, mkoutny@...e.com, cgroups@...r.kernel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH] cgroup: replace global percpu_rwsem with
 signal_struct->group_rwsem when writing cgroup.procs/threads



在 2025/9/4 15:28, Tejun Heo 写道:
> Hello,
>
> On Thu, Sep 04, 2025 at 11:15:26AM +0800, escape wrote:
>> 在 2025/9/4 00:53, Tejun Heo 写道:
>>> Hello,
> ...
>> As Ridong pointed out, in the current code, using CLONE_INTO_CGROUP
>> still requires holding the threadgroup_rwsem, so contention with fork
>> operations persists.
> Sorry about my fumbling explanations repeatedly but this isn't true. On
> cgroup2, if you create a cgroup, enable controllers and then seed it with
> CLONE_INTO_CGROUP, threadgroup_rwsem is out of the picture. The only
> remaining contention point is cgroup_mutex.
>
>> CLONE_INTO_CGROUP helps alleviate the contention between cgroup creation
>> and deletion, but its usage comes with significant limitations:
>>
>> 1. CLONE_INTO_CGROUP is only available in cgroup v2. Although cgroup v2
>> adoption is gradually increasing, many applications have not yet been
>> adapted to cgroup v2, and phasing out cgroup v1 will be a long and
>> gradual process.
>>
>> 2. CLONE_INTO_CGROUP requires specifying the cgroup file descriptor at the
>> time of process fork, effectively restricting cgroup migration to the
>> fork stage. This differs significantly from the typical cgroup attach
>> workflow. For example, in Kubernetes, systemd is the recommended cgroup
>> driver; kubelet communicates with systemd via D-Bus, and systemd
>> performs the actual cgroup attachment. In this case, the process being
>> attached typically does not have systemd as its parent. Using
>> CLONE_INTO_CGROUP in such a scenario is impractical and would require
>> coordinated changes to both systemd and kubelet.
> A percpu rwsem (threadgroup_rwsem) was used instead of per-threadgroup
> locking to avoid adding overhead to hot paths - fork and exit - because
> cgroup operations were expected to be a lot colder. Now, threadgroup rwsem
> is *really* expensive for the writers, so the trade-off could be a bit too
> extreme for some use cases.
>
> However, now that the most common usage pattern doesn't involve
> threadgroup_rwsem, I don't feel too enthusiastic about adding hot path
> overhead to work around usage patterns that we want to move away from. Note
> that dynamic migrations have other more fundamental problems for stateful
> resources and we generally want to move away from it. Sure, a single rwsem
> operation in fork/exit isn't a lot of overhead but it isn't nothing either
> and this will impact everybody.
>
> Maybe we can make it a mount option so that use cases that still depend on
> it can toggle it on? In fact, there's already favordynmods mount option
> which seems like a good fit. Maybe put the extra locking behind that flag?
>
> Thanks.
>
Thank you for your reply.

I agree that mounting with cgroupv2, creating a cgroup, enabling 
controllers,
and then seeding it with CLONE_INTO_CGROUP is an excellent solution. 
However,
we've encountered significant obstacles when applying this approach on some
older systems. We are simultaneously working on enabling CLONE_INTO_CGROUP
support in runc, systemd, and other components, but this will take some 
time.
This patch aims to alleviate the issue to some extent during this 
transitional
period.

Regarding the impact of the extra rwsem operations on hot paths, I have 
conducted
performance testing. In cases where there is no contention on down_write,
the UnixBench spawn test scores remain unaffected.

The suggestion to use the favordynmods flag to control whether the extra 
rwsem
is used is excellent. I will incorporate this condition in the next 
version of
the patch.

Thanks.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ