linux-kernel - Re: [PATCH v2] cgroup: avoid css_set_lock in cgroup_css_set

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAGudoHFaUjm7_Eh6VOOGvfscdekk7v2uNPjfLkZfAkR9aCA1Ew@mail.gmail.com>
Date: Tue, 27 Jan 2026 19:18:44 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Michal Koutný <mkoutny@...e.com>
Cc: tj@...nel.org, hannes@...xchg.org, brauner@...nel.org, 
	linux-kernel@...r.kernel.org, cgroups@...r.kernel.org
Subject: Re: [PATCH v2] cgroup: avoid css_set_lock in cgroup_css_set_fork()

On Tue, Jan 27, 2026 at 6:27 PM Michal Koutný <mkoutny@...e.com> wrote:
>
> On Tue, Jan 27, 2026 at 03:30:14PM +0100, Mateusz Guzik <mjguzik@...il.com> wrote:
> > ping? I need cgroups out of the way for further scalability work in fork+ exit
>
> I got stuck on following:
> - possible implementation with (simpler?) rwlock,
> - effect of css_set_lock in cgroup_post_fork().
>
> I want to try some measurements of the latter since I assume that
> limits the effect of the elision in cgroup_css_set_fork(), doesn't it?
> (IIUC, you'd see it again if you reduced the pidmap_lock contention.)
>

Not sure what you mean here.

If what you mean is merely converting css_set_lock into a rwlock  and
read-locking in the spot handled with seq in my patch that's a no-go
-- frequent reader vs writer transitions kill perf in their own right
and you still have 3 contention spots, i.e., this will remain as the
primary bottleneck.

To reiterate,  in the kernel as found in next at the moment top of the
profile is still the pidmap lock.

There is a patch to remove most of the overhead under the lock:
https://lore.kernel.org/linux-fsdevel/20260120-work-pidfs-rhashtable-v2-1-d593c4d0f576@kernel.org/

It may need some tidy ups but for the purpose of this discussion we
can pretend it landed.

With that in place top of the profile is the css set lock.

With *this* patch in place we are back to pidmap, which is then
followed by tasklist.

clone + exit codepaths are globally serialized as follows:
- pidmap -- once per side
- tasklist -- once on clone, twice on exit
- cgroups -- twice on clone, once on exit
- some lock for random harvesting, can probably just go

In principle with enough effort(tm) one can introduce finer-grained
locking for all of these, but I suspect it is not warranted and I'm
not going to do it, especially so for the tasklist lock.

So I very much expect the clone + exit pair will remain globally
serialized, it's the question of the nature of said serialization.

I think a sensible goal is serialization at most once per side (if
possible) and even then sanitized hold time.

The tasklist thing may be too problematic to only take twice, but even
then I can trivially reduce the hold time. If 3 spots have to remain,
it will be the new top. If I work it down to two, who knows.

So ye, css very much can be considered a problem here.

This comment:
> - effect of css_set_lock in cgroup_post_fork().

... I don't get whatsoever.

Stability of cgroup placement aside, to my reading the lock is needed
in part to serialize addition of the task to the cgroup list. No
matter what this will have to be serialized both ways with the same
thing.

Perhaps said stability can be assured in other ways and the list can
be decomposed, but that's some complexity which is not warranted.