linux-kernel - RE: [PATCH] cgroup/rstat: change cgroup_base

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CH3PR11MB78947E610456E1EDE1EE8B21F172A@CH3PR11MB7894.namprd11.prod.outlook.com>
Date: Wed, 18 Jun 2025 16:05:45 +0000
From: "Wlodarczyk, Bertrand" <bertrand.wlodarczyk@...el.com>
To: JP Kobryn <inwardvessel@...il.com>
CC: "tj@...nel.org" <tj@...nel.org>, "hannes@...xchg.org"
	<hannes@...xchg.org>, "mkoutny@...e.com" <mkoutny@...e.com>,
	"cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH] cgroup/rstat: change cgroup_base_stat to atomic

Thank you for your time and review. 

> The kernel currently faces scalability issues when multiple userspace 
> programs attempt to read cgroup statistics concurrently.
> 
> The primary bottleneck is the css_cgroup_lock in cgroup_rstat_flush, 
> which prevents access and updates to the statistics of the css from 
> multiple CPUs in parallel.
> 
> Given that rstat operates on a per-CPU basis and only aggregates 
> statistics in the parent cgroup, there is no compelling reason why 
> these statistics cannot be atomic.

> Have you considered the "tearing" that will occur when writes and reads are happening in parallel?
> The existing state is more of a snapshot approach. Changing the fields involved to atomic and lockless reading/writing can result in inconsistent values, i.e. fieldA might be more current than fieldB.

Yes, I considered it. In our scenario, "tearing" means that we have the newer data then the last "snapshot" and the output doesn't sum up in
sum_exec_runtime. The "snapshot" suggests that these stats not need to be super precise because we're accepting outdated state.
I'm not considering "tearing" issue as very bad here. 
Additionally, I'm sure that the "tearing" can happen but I didn't observe them during the benchmark runs.
That suggests that's a rare occurrence or I didn't trigger the right condition to expose the issue.

When multiple cpus tries to access the rstat the slowdown caused by css_base_lock is so massive that atomic change is preferable.
 It's better to get even "teared" result than spinlock cpus and disable irq for such long time. 
 

> By eliminating the lock, each CPU can traverse its rstat hierarchy 
> independently, without blocking. Synchronization is achieved during 
> parent propagation through atomic operations.

> Even if the tearing scenario mentioned above is acceptable, removing the lock will break synchronization of flushing non-base stat subsystems.

> -static spinlock_t *ss_rstat_lock(struct cgroup_subsys *ss) -{
> -	if (ss)
> -		return &ss->rstat_ss_lock;

> This was needed for non-base stat subsystems like memory and io.

>From what I could find in code the flush with css and cpu arguments is implemented only in two subsystems: memory and io.
Both memory and io have its own locks for them.
I tested the benchmark (provided in gist) with memory.stat and io.stat hierarchies.
In both cases the KCSAN doesn't have any issues, and performance is unchanged in comparison to the commit
e04c78d86a9699d1 (Linux 6.16-rc2).
For cpu stats the performance is much better after patch. 

I can't find a scenario when lack of this locks triggering some kind of issues.
Maybe you can help me with that?

Thanks,
Bertrand