linux-kernel - Re: [PATCH] cgroup/rstat: change cgroup_base

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <z6twrrv53dzqfpytydpdgjg23dvbxyod6zi2fyeebzduyf4pzi@jqyatn34wi2a>
Date: Wed, 18 Jun 2025 10:17:09 -0700
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: "Wlodarczyk, Bertrand" <bertrand.wlodarczyk@...el.com>
Cc: JP Kobryn <inwardvessel@...il.com>, "tj@...nel.org" <tj@...nel.org>, 
	"hannes@...xchg.org" <hannes@...xchg.org>, "mkoutny@...e.com" <mkoutny@...e.com>, 
	"cgroups@...r.kernel.org" <cgroups@...r.kernel.org>, "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] cgroup/rstat: change cgroup_base_stat to atomic

On Wed, Jun 18, 2025 at 04:05:45PM +0000, Wlodarczyk, Bertrand wrote:
> Thank you for your time and review. 
> 
> > The kernel currently faces scalability issues when multiple userspace 
> > programs attempt to read cgroup statistics concurrently.
> > 
> > The primary bottleneck is the css_cgroup_lock in cgroup_rstat_flush, 
> > which prevents access and updates to the statistics of the css from 
> > multiple CPUs in parallel.
> > 
> > Given that rstat operates on a per-CPU basis and only aggregates 
> > statistics in the parent cgroup, there is no compelling reason why 
> > these statistics cannot be atomic.
> 
> > Have you considered the "tearing" that will occur when writes and reads are happening in parallel?
> > The existing state is more of a snapshot approach. Changing the fields involved to atomic and lockless reading/writing can result in inconsistent values, i.e. fieldA might be more current than fieldB.
> 
> Yes, I considered it. In our scenario, "tearing" means that we have the newer data then the last "snapshot" and the output doesn't sum up in
> sum_exec_runtime. The "snapshot" suggests that these stats not need to be super precise because we're accepting outdated state.
> I'm not considering "tearing" issue as very bad here. 
> Additionally, I'm sure that the "tearing" can happen but I didn't observe them during the benchmark runs.
> That suggests that's a rare occurrence or I didn't trigger the right condition to expose the issue.
> 
> When multiple cpus tries to access the rstat the slowdown caused by css_base_lock is so massive that atomic change is preferable.
>  It's better to get even "teared" result than spinlock cpus and disable irq for such long time. 
>  
> 
> > By eliminating the lock, each CPU can traverse its rstat hierarchy 
> > independently, without blocking. Synchronization is achieved during 
> > parent propagation through atomic operations.
> 
> > Even if the tearing scenario mentioned above is acceptable, removing the lock will break synchronization of flushing non-base stat subsystems.
> 
> > -static spinlock_t *ss_rstat_lock(struct cgroup_subsys *ss) -{
> > -	if (ss)
> > -		return &ss->rstat_ss_lock;
> 
> > This was needed for non-base stat subsystems like memory and io.
> 
> From what I could find in code the flush with css and cpu arguments is implemented only in two subsystems: memory and io.
> Both memory and io have its own locks for them.
> I tested the benchmark (provided in gist) with memory.stat and io.stat hierarchies.
> In both cases the KCSAN doesn't have any issues, and performance is unchanged in comparison to the commit
> e04c78d86a9699d1 (Linux 6.16-rc2).
> For cpu stats the performance is much better after patch. 
> 
> I can't find a scenario when lack of this locks triggering some kind of issues.
> Maybe you can help me with that?

Please look at *_local and *_pending fields in
mem_cgroup_css_rstat_flush().