linux-kernel - Re: [PATCH v2] cgroup/rstat: change cgroup_base

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aGKxvQdAZ-vSd48D@slm.duckdns.org>
Date: Mon, 30 Jun 2025 05:48:13 -1000
From: "tj@...nel.org" <tj@...nel.org>
To: "Wlodarczyk, Bertrand" <bertrand.wlodarczyk@...el.com>
Cc: Shakeel Butt <shakeel.butt@...ux.dev>,
	"hannes@...xchg.org" <hannes@...xchg.org>,
	"mkoutny@...e.com" <mkoutny@...e.com>,
	"cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"inwardvessel@...il.com" <inwardvessel@...il.com>
Subject: Re: [PATCH v2] cgroup/rstat: change cgroup_base_stat to atomic

Hello,

On Mon, Jun 30, 2025 at 02:25:27PM +0000, Wlodarczyk, Bertrand wrote:
> >  > Also the response to the tearing issue explained by JP is not satisfying.
> > 
> > In other words, the claim is: "it's better to stall other cpus in 
> > spinlock plus disable IRQ every time in order to serve outdated snapshot instead of providing user to the freshest statistics much, much faster".
> > In term of statistics, freshest data served fast to the user is, in my opinion, better behavior.
> 
> > This is a false choice, I think. e.g. We can easily use seqlock to remove strict synchronization only from user side, right?
> 
> Yes, that's second possibility to solve a problem.
> I choose atomics approach because, in my opinion, incremental statistics are somewhat natural use case for them.

They're good for individual counters but I'm not sure they're natural fit
for a group of stats. A series of atomic ops can be significantly more
expensive than locked updates and it also comes with problems like split
updates as discussed in this thread. I think most of resistance is from the
use of atomics. Can you please try a different approach?

> > I wouldn't be addressing this issue if there were no customers 
> > affected by rstat latency in multi-container multi-cpu scenarios.
> 
> > Out of curiosity, can you explain the case that you observed in more detail?
> > What were the customer doing?
> 
> Single hierarchy, hundreds of the containers on one server, multiple independent owners.
> Some of them wants to have current stats available in their webgui.
> They are hammering the stats for their cgroups. 
> Server experience inefficiencies, perf shows visible percentage of cpu cycles spent in cgroup_rstat_flush.
> 
> I prepared benchmark which can be example of the issue faced by the customer:
> https://gist.github.com/bwlodarcz/21bbc24813bced8e6ffc9e5ca3150fcc
> 
> qemu vm:
>                +---------+---------+
>      mean (s)  |8dcb0ed8 | patched |
> +--------------+---------+---------+
> |cpu, KCSAN on |16.13*   |3.75     |
> +--------------+---------+---------+
> |cpu, KCSAN off|4.45     |0.81     |
> +--------------+---------+---------+
> *race condition still present
> 
> It's not hammering the lock so much as previous stressor, so the results are better for for-6.17 branch.
> The customer has much bigger scale than 4 cgroups in benchmark. 
> There are workarounds implemented so it's not that hot now (for them).
> Anyway, I think it's worth to try improving the scalability situation, 
> especially that as far as I see it, there are no downsides.
>  
> There also reports about similar problems in memory rstats but I didn't look on them yet. 

Yeah, I saw the benchmark but I was more curious what actual use case would
lead to behaviors like that because you'd have to hammer on those stats
really hard for this to be a problem. In most use cases that I'm aware of,
the polling frequencies of these stats are >= 1sec. I guess the users in
your use case were banging on them way harder, at least previously.

I don't think switching to atomics is a good idea, but improving the read
scalability would definitely be nice.

Thanks.

-- 
tejun