lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5f055416-9c49-42c0-9ba0-e45f6aaeac04@gmail.com>
Date: Tue, 17 Jun 2025 14:05:37 -0700
From: JP Kobryn <inwardvessel@...il.com>
To: Bertrand Wlodarczyk <bertrand.wlodarczyk@...el.com>, tj@...nel.org,
 hannes@...xchg.org, mkoutny@...e.com, cgroups@...r.kernel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH] cgroup/rstat: change cgroup_base_stat to atomic

On 6/17/25 3:26 AM, Bertrand Wlodarczyk wrote:
> The kernel currently faces scalability issues when multiple userspace
> programs attempt to read cgroup statistics concurrently.
> 
> The primary bottleneck is the css_cgroup_lock in cgroup_rstat_flush,
> which prevents access and updates to the statistics
> of the css from multiple CPUs in parallel.
> 
> Given that rstat operates on a per-CPU basis and only aggregates
> statistics in the parent cgroup, there is no compelling reason
> why these statistics cannot be atomic.

Have you considered the "tearing" that will occur when writes and reads
are happening in parallel?
The existing state is more of a snapshot approach. Changing the fields
involved to atomic and lockless reading/writing can result in
inconsistent values, i.e. fieldA might be more current than fieldB.

> By eliminating the lock, each CPU can traverse its rstat hierarchy
> independently, without blocking. Synchronization is achieved during
> parent propagation through atomic operations.

Even if the tearing scenario mentioned above is acceptable, removing
the lock will break synchronization of flushing non-base stat
subsystems.

> 
> This change significantly enhances performance in scenarios
> where multiple CPUs access CPU rstat within a single cgroup hierarchy,
> yielding a performance improvement of around 50 times compared
> to the mainline version.
> Notably, performance for memory and I/O rstats remains unchanged,
> as these are managed in separate submodules.
> 
> Additionally, this patch addresses a race condition detectable
> in the current mainline by KCSAN in __cgroup_account_cputime,
> which occurs when attempting to read a single hierarchy
> from multiple CPUs.
> 
> Signed-off-by: Bertrand Wlodarczyk <bertrand.wlodarczyk@...el.com>
> ---
> Benchmark code: https://gist.github.com/bwlodarcz/c955b36b5667f0167dffcff23953d1da
> 
> Tested on Intel(R) Xeon(R) Platinum 8468V, 2s 48c 2tpc, 377GiB RAM, Fedora 41:
> +--------+-------+
> |Mainline|Patched|
> +--------+-------+
> |369.95s |6.52s  |
> +--------+-------+
> 
[..]
> @@ -820,7 +813,6 @@ struct cgroup_subsys {
>   	 */
>   	unsigned int depends_on;
>   
> -	spinlock_t rstat_ss_lock;

This lock is not used with base stats. The base stats are not a formal
subsystem.

[..]
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index cbeaa499a96a..36af2b883440 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -9,7 +9,6 @@
>   
>   #include <trace/events/cgroup.h>
>   
> -static DEFINE_SPINLOCK(rstat_base_lock);
>   static DEFINE_PER_CPU(raw_spinlock_t, rstat_base_cpu_lock);
>   
>   static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
> @@ -37,14 +36,6 @@ static struct cgroup_rstat_base_cpu *cgroup_rstat_base_cpu(
>   	return per_cpu_ptr(cgrp->rstat_base_cpu, cpu);
>   }
>   
> -static spinlock_t *ss_rstat_lock(struct cgroup_subsys *ss)
> -{
> -	if (ss)
> -		return &ss->rstat_ss_lock;

This was needed for non-base stat subsystems like memory and io.

> -
> -	return &rstat_base_lock;
> -}

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ