lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CH3PR11MB7894B117C6A19C547F28EDA1F15DA@CH3PR11MB7894.namprd11.prod.outlook.com>
Date: Mon, 21 Jul 2025 11:48:45 +0000
From: "Wlodarczyk, Bertrand" <bertrand.wlodarczyk@...el.com>
To: "tj@...nel.org" <tj@...nel.org>
CC: Shakeel Butt <shakeel.butt@...ux.dev>, "hannes@...xchg.org"
	<hannes@...xchg.org>, "mkoutny@...e.com" <mkoutny@...e.com>,
	"cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"inwardvessel@...il.com" <inwardvessel@...il.com>
Subject: RE: [PATCH v2] cgroup/rstat: change cgroup_base_stat to atomic

>>> Yeah, I saw the benchmark but I was more curious what actual use case 
>>> would lead to behaviors like that because you'd have to hammer on 
>>> those stats really hard for this to be a problem. In most use cases 
>>> that I'm aware of, the polling frequencies of these stats are >= 
>>> 1sec. I guess the users in your use case were banging on them way 
>>> harder, at least previously.
>> 
>> From what I know, the https://github.com/google/cadvisor instances 
>> deployed on the client machine hammered these stats. Sharing servers 
>> between independent teams or orgs in big corps is frequent. Every 
>> interested party deployed its own, or similar, instance. We can say 
>> just don't do that and be fine, but it will be happening anyway. It's 
>> better to just make rstats more robust.

> I do think this is a valid use case. I just want to get some sense on the numbers involved. Do you happen to know what frequency cAdvisor was polling the stats at and how many instances were running? The numbers don't have to be accurate. I just want to know the ballpark numbers.

I'm quoting colleague here: 
"the frequency to call cadvisor, when every 1 ms call rstat_flush for each container(100 total), the contention is high. interval larger than 5ms we'll see less contention in my experiments." 

Experiment,CPU,Core #,Workload,Container #,FS,Cgroup flush interval,Spin lock,Flush time spend for 1000 iterations (ms),Flush Latency (avg) per iteration
1,GNR-AP,128,Proxy Load v1_lite,100,EXT4,1ms   ,,Min Time = 1751.91 msMax Time = 2232.26 msAvg Time = 1919.72 ms,919.72/1000
2,GNR-AP,128,Proxy Load v1_lite,100,EXT4,1.5ms,,Min Time = 1987.79 msMax Time = 2014.94 msAvg Time = 2001.14 ms,501.14/1000
3,GNR-AP,128,Proxy Load v1_lite,100,EXT4,1.7ms,,Min Time = 2025.42 msMax Time = 2044.56 msAvg Time = 2036.16 ms,336.16/1000
4,GNR-AP,128,Proxy Load v1_lite,100,EXT4,2ms,,Min Time = 2113.47 msMax Time = 2120.04 msAvg Time = 2116.33 ms,116.33/1000
5,GNR-AP,128,Proxy Load v1_lite,100,EXT4,5ms,,Min Time = 5160.85 msMax Time = 5170.68 msAvg Time = 5165.59 ms,165.59 / 1000 = 0.1656 ms
6,GNR-AP,128,Proxy Load v1_lite,100,EXT4,10ms,,Min Time = 10164.12 msMax Time = 10174.46 msAvg Time = 10168.97 ms,168.97 / 1000 = 0.169 ms
7,GNR-AP,128,Proxy Load v1_lite,100,EXT4,100ms,,Min Time = 100165.41 msMax Time = 100182.40 msAvg Time = 100174.89 ms,174.89 / 1000 = 0.1749 ms
 
It seems that the client had cadvisor set on 1 ms poll. I don't have any information if that was intentional or not (seems too frequent).

Thanks,
Bertrand

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ