[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CH3PR11MB7894DDEE6C630D5A3A4D23A1F145A@CH3PR11MB7894.namprd11.prod.outlook.com>
Date: Fri, 27 Jun 2025 13:15:31 +0000
From: "Wlodarczyk, Bertrand" <bertrand.wlodarczyk@...el.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
CC: "tj@...nel.org" <tj@...nel.org>, "hannes@...xchg.org"
<hannes@...xchg.org>, "mkoutny@...e.com" <mkoutny@...e.com>,
"cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"inwardvessel@...il.com" <inwardvessel@...il.com>
Subject: RE: [PATCH v2] cgroup/rstat: change cgroup_base_stat to atomic
> The kernel faces scalability issues when multiple userspace programs
> attempt to read cgroup statistics concurrently.
>
> The primary bottleneck is the css_cgroup_lock in cgroup_rstat_flush,
> which prevents access and updates to the statistics of the css from
> multiple CPUs in parallel.
>
> Given that rstat operates on a per-CPU basis and only aggregates
> statistics in the parent cgroup, there is no compelling reason why
> these statistics cannot be atomic.
> By eliminating the lock during CPU statistics access, each CPU can
> traverse its rstat hierarchy independently, without blocking.
> Synchronization is achieved during parent propagation through atomic
> operations.
>
> This change significantly enhances performance on commit
> 8dcb0ed834a3ec03 ("memcg: cgroup: call css_rstat_updated irrespective
> of in_nmi()") in scenarios where multiple CPUs accessCPU rstat within
> a single cgroup hierarchy, yielding a performance improvement of around 40 times.
> Notably, performance for memory and I/O rstats remains unchanged, as
> the lock remains in place for these usages.
>
> Additionally, this patch addresses a race condition detectable in the
> current mainline by KCSAN in __cgroup_account_cputime, which occurs
> when attempting to read a single hierarchy from multiple CPUs.
>
> Signed-off-by: Bertrand Wlodarczyk <bertrand.wlodarczyk@...el.com>
> This patch breaks memory controller as explained in the comments on the previous version.
Ekhm... no? I addressed the issue and v2 has lock back and surrounding the call to dependent submodules?
The behavior is the same as before patching.
In the long term, in my opinion, the atomics should happen also in dependent submodules to eliminate locks
completely.
> Also the response to the tearing issue explained by JP is not satisfying.
In other words, the claim is: "it's better to stall other cpus in spinlock plus disable IRQ every time in order to
serve outdated snapshot instead of providing user to the freshest statistics much, much faster".
In term of statistics, freshest data served fast to the user is, in my opinion, better behavior.
I wouldn't be addressing this issue if there were no customers affected by rstat latency in multi-container
multi-cpu scenarios.
> Please run scripts/faddr2line on css_rstat_flush+0x1b0/0xed0 and
> css_rstat_updated+0x8f/0x1a0 to see which field is causing the race.
There is more than race in current for-next-6.17. In faddr2line first line writes, second reads.
Benchmark is provided in gist - it's exposing the issue.
[ 30.547317] BUG: KCSAN: data-race in css_rstat_flush / css_rstat_updated
[ 30.549011]
[ 30.549483] write to 0xffd1ffffff686a30 of 8 bytes by task 1014 on cpu 82:
[ 30.551124] css_rstat_flush+0x1b0/0xed0
[ 30.552260] cgroup_base_stat_cputime_show+0x96/0x2f0
[ 30.553582] cpu_stat_show+0x14/0x1a0
[ 30.555477] cgroup_seqfile_show+0xb0/0x150
[ 30.557060] kernfs_seq_show+0x93/0xb0
[ 30.558241] seq_read_iter+0x190/0x7d0
[ 30.559278] kernfs_fop_read_iter+0x23b/0x290
[ 30.560416] vfs_read+0x46b/0x5a0
[ 30.561336] ksys_read+0xa5/0x130
[ 30.562190] __x64_sys_read+0x3c/0x50
[ 30.563179] x64_sys_call+0x19e1/0x1c10
[ 30.564215] do_syscall_64+0xa2/0x200
[ 30.565214] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 30.566456]
[ 30.566892] read to 0xffd1ffffff686a30 of 8 bytes by interrupt on cpu 74:
[ 30.568472] css_rstat_updated+0x8f/0x1a0
[ 30.569499] __cgroup_account_cputime+0x5d/0x90
[ 30.570640] update_curr+0x1bd/0x260
[ 30.571559] task_tick_fair+0x3b/0x130
[ 30.572545] sched_tick+0xa1/0x220
[ 30.573510] update_process_times+0x97/0xd0
[ 30.574576] tick_nohz_handler+0xfc/0x220
[ 30.575650] __hrtimer_run_queues+0x2a3/0x4b0
[ 30.576703] hrtimer_interrupt+0x1c6/0x3a0
[ 30.577761] __sysvec_apic_timer_interrupt+0x62/0x180
[ 30.578982] sysvec_apic_timer_interrupt+0x6b/0x80
[ 30.580161] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 30.581397] _raw_spin_unlock_irq+0x18/0x30
[ 30.582505] css_rstat_flush+0x5cd/0xed0
[ 30.583611] cgroup_base_stat_cputime_show+0x96/0x2f0
[ 30.584934] cpu_stat_show+0x14/0x1a0
[ 30.585814] cgroup_seqfile_show+0xb0/0x150
[ 30.586915] kernfs_seq_show+0x93/0xb0
[ 30.587876] seq_read_iter+0x190/0x7d0
[ 30.588797] kernfs_fop_read_iter+0x23b/0x290
[ 30.589904] vfs_read+0x46b/0x5a0
[ 30.590723] ksys_read+0xa5/0x130
[ 30.591659] __x64_sys_read+0x3c/0x50
[ 30.592612] x64_sys_call+0x19e1/0x1c10
[ 30.593593] do_syscall_64+0xa2/0x200
[ 30.594523] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 30.595756]
[ 30.596305] value changed: 0x0000000000000000 -> 0xffd1ffffff686a30
[ 30.597787]
[ 30.598286] Reported by Kernel Concurrency Sanitizer on:
[ 30.599583] CPU: 74 UID: 0 PID: 1006 Comm: benchmark Not tainted 6.15.0-g633e6bad3124 #12 PREEMPT(voluntary)
[ 30.601968] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
./scripts/faddr2line vmlinux css_rstat_flush+0x1b0/0xed0 css_rstat_updated+0x8f/0x1a0
css_rstat_flush+0x1b0/0xed0:
init_llist_node at include/linux/llist.h:86
(inlined by) llist_del_first_init at include/linux/llist.h:308
(inlined by) css_process_update_tree at kernel/cgroup/rstat.c:148
(inlined by) css_rstat_updated_list at kernel/cgroup/rstat.c:258
(inlined by) css_rstat_flush at kernel/cgroup/rstat.c:389
css_rstat_updated+0x8f/0x1a0:
css_rstat_updated at kernel/cgroup/rstat.c:90 (discriminator 1)
---
[ 140.063127] BUG: KCSAN: data-race in __cgroup_account_cputime / css_rstat_flush
[ 140.064809]
[ 140.065290] write to 0xffd1ffffff711f50 of 8 bytes by interrupt on cpu 76:
[ 140.067221] __cgroup_account_cputime+0x4a/0x90
[ 140.068346] update_curr+0x1bd/0x260
[ 140.069278] task_tick_fair+0x3b/0x130
[ 140.070226] sched_tick+0xa1/0x220
[ 140.071080] update_process_times+0x97/0xd0
[ 140.072091] tick_nohz_handler+0xfc/0x220
[ 140.073048] __hrtimer_run_queues+0x2a3/0x4b0
[ 140.074105] hrtimer_interrupt+0x1c6/0x3a0
[ 140.075081] __sysvec_apic_timer_interrupt+0x62/0x180
[ 140.076262] sysvec_apic_timer_interrupt+0x6b/0x80
[ 140.077423] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 140.078625] _raw_spin_unlock_irq+0x18/0x30
[ 140.079579] css_rstat_flush+0x5cd/0xed0
[ 140.080501] cgroup_base_stat_cputime_show+0x96/0x2f0
[ 140.081638] cpu_stat_show+0x14/0x1a0
[ 140.082534] cgroup_seqfile_show+0xb0/0x150
[ 140.083534] kernfs_seq_show+0x93/0xb0
[ 140.084457] seq_read_iter+0x190/0x7d0
[ 140.085373] kernfs_fop_read_iter+0x23b/0x290
[ 140.086416] vfs_read+0x46b/0x5a0
[ 140.087263] ksys_read+0xa5/0x130
[ 140.088088] __x64_sys_read+0x3c/0x50
[ 140.088921] x64_sys_call+0x19e1/0x1c10
[ 140.089814] do_syscall_64+0xa2/0x200
[ 140.090698] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 140.091932]
[ 140.092357] read to 0xffd1ffffff711f50 of 8 bytes by task 1172 on cpu 16:
[ 140.093877] css_rstat_flush+0x717/0xed0
[ 140.094791] cgroup_base_stat_cputime_show+0x96/0x2f0
[ 140.095989] cpu_stat_show+0x14/0x1a0
[ 140.096866] cgroup_seqfile_show+0xb0/0x150
[ 140.097817] kernfs_seq_show+0x93/0xb0
[ 140.098694] seq_read_iter+0x190/0x7d0
[ 140.099625] kernfs_fop_read_iter+0x23b/0x290
[ 140.100674] vfs_read+0x46b/0x5a0
[ 140.101529] ksys_read+0xa5/0x130
[ 140.102382] __x64_sys_read+0x3c/0x50
[ 140.103290] x64_sys_call+0x19e1/0x1c10
[ 140.104252] do_syscall_64+0xa2/0x200
[ 140.105157] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 140.106343]
[ 140.106750] value changed: 0x000000032a8e1130 -> 0x000000032ab08ca6
[ 140.108251]
[ 140.108670] Reported by Kernel Concurrency Sanitizer on:
[ 140.109910] CPU: 16 UID: 0 PID: 1172 Comm: benchmark Not tainted 6.15.0-g633e6bad3124 #12 PREEMPT(voluntary)
[ 140.112075] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
./scripts/faddr2line vmlinux __cgroup_account_cputime+0x4a/0x90 css_rstat_flush+0x717/0xed0
__cgroup_account_cputime+0x4a/0x90:
__cgroup_account_cputime at kernel/cgroup/rstat.c:595
css_rstat_flush+0x717/0xed0:
cgroup_base_stat_flush at kernel/cgroup/rstat.c:546
(inlined by) css_rstat_flush at kernel/cgroup/rstat.c:392
---
[ 156.387539] BUG: KCSAN: data-race in __cgroup_account_cputime_field / css_rstat_flush
[ 156.389371]
[ 156.389784] write to 0xffd1fffffe7d1f40 of 8 bytes by interrupt on cpu 15:
[ 156.391394] __cgroup_account_cputime_field+0x9d/0xe0
[ 156.392539] account_system_index_time+0x84/0x90
[ 156.393585] update_process_times+0x25/0xd0
[ 156.394544] tick_nohz_handler+0xfc/0x220
[ 156.395517] __hrtimer_run_queues+0x2a3/0x4b0
[ 156.396544] hrtimer_interrupt+0x1c6/0x3a0
[ 156.397515] __sysvec_apic_timer_interrupt+0x62/0x180
[ 156.398660] sysvec_apic_timer_interrupt+0x6b/0x80
[ 156.399769] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 156.400937] _raw_spin_unlock_irq+0x18/0x30
[ 156.401902] css_rstat_flush+0x5cd/0xed0
[ 156.402774] cgroup_base_stat_cputime_show+0x96/0x2f0
[ 156.403940] cpu_stat_show+0x14/0x1a0
[ 156.404763] cgroup_seqfile_show+0xb0/0x150
[ 156.405724] kernfs_seq_show+0x93/0xb0
[ 156.406643] seq_read_iter+0x190/0x7d0
[ 156.407522] kernfs_fop_read_iter+0x23b/0x290
[ 156.408549] vfs_read+0x46b/0x5a0
[ 156.409386] ksys_read+0xa5/0x130
[ 156.410176] __x64_sys_read+0x3c/0x50
[ 156.410973] x64_sys_call+0x19e1/0x1c10
[ 156.411862] do_syscall_64+0xa2/0x200
[ 156.412673] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 156.413814]
[ 156.414249] read to 0xffd1fffffe7d1f40 of 8 bytes by task 1140 on cpu 85:
[ 156.415718] css_rstat_flush+0x6fe/0xed0
[ 156.416669] cgroup_base_stat_cputime_show+0x96/0x2f0
[ 156.417855] cpu_stat_show+0x14/0x1a0
[ 156.418684] cgroup_seqfile_show+0xb0/0x150
[ 156.419637] kernfs_seq_show+0x93/0xb0
[ 156.420519] seq_read_iter+0x190/0x7d0
[ 156.421395] kernfs_fop_read_iter+0x23b/0x290
[ 156.422413] vfs_read+0x46b/0x5a0
[ 156.423228] ksys_read+0xa5/0x130
[ 156.423974] __x64_sys_read+0x3c/0x50
[ 156.424773] x64_sys_call+0x19e1/0x1c10
[ 156.425704] do_syscall_64+0xa2/0x200
[ 156.426600] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 156.427723]
[ 156.428217] value changed: 0x00000004be5ffd29 -> 0x00000004be6f3f69
[ 156.429575]
[ 156.430024] Reported by Kernel Concurrency Sanitizer on:
[ 156.431227] CPU: 85 UID: 0 PID: 1140 Comm: benchmark Not tainted 6.15.0-g633e6bad3124 #12 PREEMPT(voluntary)
[ 156.433406] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
./scripts/faddr2line vmlinux __cgroup_account_cputime_field+0x9d/0xe0 css_rstat_flush+0x6fe/0xed0
__cgroup_account_cputime_field+0x9d/0xe0:
__cgroup_account_cputime_field at kernel/cgroup/rstat.c:617
css_rstat_flush+0x6fe/0xed0:
cgroup_base_stat_flush at kernel/cgroup/rstat.c:546
(inlined by) css_rstat_flush at kernel/cgroup/rstat.c:392
Powered by blists - more mailing lists