[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAGudoHFBN1seqAb3_=Ja+9jXP3EDjfkGfvGT6eqSBhB5_mrBWg@mail.gmail.com>
Date: Tue, 10 Feb 2026 18:33:19 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Michal Koutný <mkoutny@...e.com>
Cc: tj@...nel.org, hannes@...xchg.org, brauner@...nel.org,
linux-kernel@...r.kernel.org, cgroups@...r.kernel.org
Subject: Re: [PATCH v2] cgroup: avoid css_set_lock in cgroup_css_set_fork()
On Tue, Feb 10, 2026 at 5:55 PM Michal Koutný <mkoutny@...e.com> wrote:
>
> On Tue, Feb 10, 2026 at 12:19:27PM +0100, Mateusz Guzik <mjguzik@...il.com> wrote:
> > This is going to depend on the scale you test on. I was testing on
> > south of 32. But I also got a miniscule win from removing css set lock
> > as the problem for me, instead everything shifted to tasklist.
>
> To be on the same page -- that means you have nr_cpus >= 32?
>
south means less
> > Per my other e-mail tasklist lock retains the terrible 3-times locking
> > and it is doing rather expensive work while holding it. It is
> > plausible it happens to be at the top at that scale, but that's only
> > an argument for fixing it. Even if you don't see the css thing at the
> > top at the moment, it will be there once someone(tm) sorts out the
> > tasklist problem.
>
> I did a quick test (with 6.18.8-1.g886f4c4-default), first `perf top`
> while will-it-scale was running:
I don't know what this hash corresponds to.
>
> 74.23% [kernel] [k] native_queued_spin_lock_slowpath
> 6.91% [kernel] [k] intel_idle_irq
> 0.87% [kernel] [k] update_sd_lb_stats.constprop.0
> 0.68% [kernel] [k] _raw_spin_lock
> 0.63% [kernel] [k] clear_page_erms
> 0.56% [kernel] [k] sched_balance_find_dst_group
> 0.40% [kernel] [k] alloc_vmap_area
>
> and then bpftrace for the waiters:
> $ bpftrace -e 'kprobe:native_queued_spin_lock_slowpath {@[arg0]=count();}
> END {for($kv : @) {printf("%s\t%d\n", ksym($kv.0), (int64)$kv.1);} clear(@); }'\
> >bpftrace.out
> $ sort -k2 -r -n bpftrace.out | head | column -t
> pidmap_lock 10482583
> nft_pcpu_tun_ctx 3693517
> css_set_lock 1511164
> input_pool 976252
> tasklist_lock 798578
> nft_pcpu_tun_ctx 481962
> 0xffff8abc3ffd55b0 95371
> 0xffff8a6d3ffd65b0 93686
> 0xffff8a5e218f0840 29501
> 0xffff8a5e451dca40 29421
>
> or measured by cummulative waiting time:
> $ bpftrace -e 'kprobe:native_queued_spin_lock_slowpath {@[cpu]=arg0; @st[cpu]=nsecs;}
> kretprobe:native_queued_spin_lock_slowpath /@[cpu]/ {$lat=nsecs-@st[cpu]; @lats[@[cpu]]=sum($lat);}
> END {for($kv : @lats) {printf("%s\t%d\n", ksym($kv.0), (int64)$kv.1);} clear(@lats); clear(@st); clear(@) }'\
> >bpftrace2.out
>
> $ sort -k2 -r -n bpftrace2.out | head -n15 | column -t
> pidmap_lock 1931209805
> rcu_state 1823286316
> rcu_state 1581455156
> rcu_state 1328804835
> rcu_state 1299517157
> rcu_state 1134101627
> nft_pcpu_tun_ctx 1027837665
> 0xffff8abc3ffd55b0 861441978
> 0xffff8a6d3ffd65b0 850732998
> css_set_lock 520009479
> input_pool 316598763
> tasklist_lock 127161061
> 0xffff8aac40023200 32380418
> 0xffff8a5e002ab600 30194951
> rcu_state 18334578
>
If the only thing you applied is the patchset over at
https://lore.kernel.org/linux-mm/20251206131955.780557-1-mjguzik@gmail.com/
, then this lines up with my own measurements, where I said the pidmap
lock remains dominant.
That thing gets unclogged with a patch by Christian to move pidmap
handling out, which can be found here:
https://lore.kernel.org/all/20260120-work-pidfs-rhashtable-v2-1-d593c4d0f576@kernel.org/
Afterwards it is css_set_lock at the top of the profile.
> Hm, it's interesting that is suggestive of why I saw no big change with
> css_set_lock in my setup.
>
Regardless, of the above, I noted sorting out this lock does not
meaningfully improve performance, it merely shifts contention to
tasklist afterwards.
>
> Michal
Powered by blists - more mailing lists