linux-kernel - RE: [PATCH] cgroup/rstat: change cgroup_base

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CH3PR11MB7894190E3C849233A444E5CCF172A@CH3PR11MB7894.namprd11.prod.outlook.com>
Date: Wed, 18 Jun 2025 14:31:53 +0000
From: "Wlodarczyk, Bertrand" <bertrand.wlodarczyk@...el.com>
To: Michal Koutný <mkoutny@...e.com>
CC: "tj@...nel.org" <tj@...nel.org>, "hannes@...xchg.org"
	<hannes@...xchg.org>, "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, JP Kobryn
	<inwardvessel@...il.com>
Subject: RE: [PATCH] cgroup/rstat: change cgroup_base_stat to atomic

Thank you for your time and answer. 

> The kernel currently faces scalability issues when multiple userspace 
> programs attempt to read cgroup statistics concurrently.

> Does "currently" mean that you didn't observe this before per-subsys split?

That means the current e04c78d86a9699d1 (Linux 6.16-rc2). It was observed by our customer around winter this year and it's still present.
I believe it's present since the lock exists. 

> The primary bottleneck is the css_cgroup_lock in cgroup_rstat_flush, 
> which prevents access and updates to the statistics of the css from 
> multiple CPUs in parallel.

> I think this description needs some refresh on top of the current mainline (at least after the commit 748922dcfabdd ("cgroup: use subsystem-specific rstat locks to avoid contention") to be clear which lock (and locking functions) is apparently contentious.

The main culprit is css_cgroup_lock in cgroup_rstat_flush. It's locking css although the main algo operates mainly on per cpu data. 
Only propagation to parent needs to be locked but only if the data isn't atomic.
The benchmark results were gathered after the patch 748922dcfabdd on top of the commit e04c78d86a9699d1 (Linux 6.16-rc2).

> Notably, performance for memory and I/O rstats remains unchanged, as 
> these are managed in separate submodules.

> Additionally, this patch addresses a race condition detectable in the 
> current mainline by KCSAN in __cgroup_account_cputime, which occurs 
> when attempting to read a single hierarchy from multiple CPUs.

> Could you please extract this fix and send it separately?

Unfortunately, I don't have it. My primary objective was to resolve performance bottleneck during rstat access for customer. I found the race condition by accident after my benchmark (provided in gist) run with KCSAN. Didn't investigated race alone.

Thanks,
Bertrand