linux-kernel - RE: [PATCH v2] cgroup/rstat: change cgroup_base

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CH3PR11MB7894D2570AEA9AB67DBF706DF146A@CH3PR11MB7894.namprd11.prod.outlook.com>
Date: Mon, 30 Jun 2025 14:25:27 +0000
From: "Wlodarczyk, Bertrand" <bertrand.wlodarczyk@...el.com>
To: "tj@...nel.org" <tj@...nel.org>
CC: Shakeel Butt <shakeel.butt@...ux.dev>, "hannes@...xchg.org"
	<hannes@...xchg.org>, "mkoutny@...e.com" <mkoutny@...e.com>,
	"cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"inwardvessel@...il.com" <inwardvessel@...il.com>
Subject: RE: [PATCH v2] cgroup/rstat: change cgroup_base_stat to atomic

>  > Also the response to the tearing issue explained by JP is not satisfying.
> 
> In other words, the claim is: "it's better to stall other cpus in 
> spinlock plus disable IRQ every time in order to serve outdated snapshot instead of providing user to the freshest statistics much, much faster".
> In term of statistics, freshest data served fast to the user is, in my opinion, better behavior.

> This is a false choice, I think. e.g. We can easily use seqlock to remove strict synchronization only from user side, right?

Yes, that's second possibility to solve a problem.
I choose atomics approach because, in my opinion, incremental statistics are somewhat natural use case for them.

> I wouldn't be addressing this issue if there were no customers 
> affected by rstat latency in multi-container multi-cpu scenarios.

> Out of curiosity, can you explain the case that you observed in more detail?
> What were the customer doing?

Single hierarchy, hundreds of the containers on one server, multiple independent owners.
Some of them wants to have current stats available in their webgui.
They are hammering the stats for their cgroups. 
Server experience inefficiencies, perf shows visible percentage of cpu cycles spent in cgroup_rstat_flush.

I prepared benchmark which can be example of the issue faced by the customer:
https://gist.github.com/bwlodarcz/21bbc24813bced8e6ffc9e5ca3150fcc

qemu vm:
               +---------+---------+
     mean (s)  |8dcb0ed8 | patched |
+--------------+---------+---------+
|cpu, KCSAN on |16.13*   |3.75     |
+--------------+---------+---------+
|cpu, KCSAN off|4.45     |0.81     |
+--------------+---------+---------+
*race condition still present

It's not hammering the lock so much as previous stressor, so the results are better for for-6.17 branch.
The customer has much bigger scale than 4 cgroups in benchmark. 
There are workarounds implemented so it's not that hot now (for them).
Anyway, I think it's worth to try improving the scalability situation, 
especially that as far as I see it, there are no downsides.

There also reports about similar problems in memory rstats but I didn't look on them yet.