lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <57E7A552-AFFA-4860-8149-747CBA5C0541@amazon.de>
Date: Tue, 27 May 2025 09:08:55 +0000
From: "Krcka, Tomas" <krckatom@...zon.de>
To: "longman@...hat.com" <longman@...hat.com>
CC: "dave@...olabs.net" <dave@...olabs.net>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 5/5] locking/rwsem: Remove reader optimistic spinning

Hi Waiman,

I recently discovered that this patch ( commit 617f3ef95177 
("locking/rwsem: Remove reader optimistic spinning") ) results in up to 50% performance drop in
real life scenarios - more processes are trying to perform operation on top of sysfs (read and write).
Reverting of the patch gained the performance back. My suggestion would be to revert the patch
from mainline as well.

I notice that degradation initially due to a workload where more processes are  accessing
same sysfs seeing up to 50% performance drop - time spent to proceed the test.
After investigation, I traced the root cause back to two related changes:
first, when kernfs switched from mutex to rwsem (commit 7ba0273b2f34 ("kernfs: switch kernfs to use an rwsem")), 
and ultimately to the removal of reader optimistic spinning.

The lock contention tracing shows a clear pattern: process accessing a kernfs_dentry_node taking read semaphore is 
now forced to take the slow path since the lock is taken by another process operating on same node and needs write semaphore.
(See below ftrace for the exact operations.)

This contrasts with the previous behavior where optimistic spinning prevented such 
situations.

I have confirmed this behavior across multiple kernel versions (6.14.4, 6.8.12, 6.6.80), as well if backporting
the mentioned commits to older versions (specifically v5.10)
While the real-world impact was observed on AArch64, I've successfully reproduced
the core issue using our test case on both AArch64 (192 vCPUs) and x86 Ice Lake (128 vCPUs) systems.

While I identified this through sysfs (kernfs) operations, I believe this regression 
could affect other subsystems using reader-writer semaphores with similar access 
patterns. 

ftrace with the commit showing this pattern:

""""
userspace_bench-6796    [000] .....  2328.023515: contention_begin: 00000000ca66c48e (flags=READ)
^^^ - waiting for lock and now all next threads will be waiting

userspace_bench-6796    [000] d....  2328.023518: sched_switch: prev_comm=userspace_bench prev_pid=6796 prev_prio=120 prev_state=D ==> next_comm=userspace_bench next_pid=6798 next_prio=120
userspace_bench-6806    [009] d....  2328.023524: sched_switch: prev_comm=userspace_bench prev_pid=6806 prev_prio=120 prev_state=R+ ==> next_comm=migration/9 next_pid=70 next_prio=0
userspace_bench-6804    [004] d....  2328.023532: contention_begin: 00000000ca66c48e (flags=WRITE)
userspace_bench-6805    [005] d....  2328.023532: contention_begin: 00000000ca66c48e (flags=WRITE)
userspace_bench-6804    [004] d....  2328.023533: sched_switch: prev_comm=userspace_bench prev_pid=6804 prev_prio=120 prev_state=D ==> next_comm=swapper/4 next_pid=0 next_prio=120
userspace_bench-6797    [001] .....  2328.023534: contention_begin: 00000000ca66c48e (flags=READ)

... [cut] .... 

userspace_bench-6807    [007] .....  2328.023661: contention_begin: 00000000ca66c48e (flags=READ)
userspace_bench-6807    [007] d....  2328.023666: sched_switch: prev_comm=userspace_bench prev_pid=6807 prev_prio=120 prev_state=D ==> next_comm=swapper/7 next_pid=0 next_prio=120
userspace_bench-6813    [013] .....  2328.023669: contention_begin: 00000000ca66c48e (flags=READ)
userspace_bench-6815    [015] .....  2328.023673: contention_begin: 00000000ca66c48e (flags=READ)
userspace_bench-6813    [013] d....  2328.023674: sched_switch: prev_comm=userspace_bench prev_pid=6813 prev_prio=120 prev_state=D ==> next_comm=swapper/13 next_pid=0 next_prio=120
userspace_bench-6815    [015] d....  2328.023675: sched_switch: prev_comm=userspace_bench prev_pid=6815 prev_prio=120 prev_state=D ==> next_comm=swapper/15 next_pid=0 next_prio=120
userspace_bench-6803    [003] .....  2328.026170: contention_begin: 00000000ca66c48e (flags=READ)
userspace_bench-6803    [003] d....  2328.026171: sched_switch: prev_comm=userspace_bench prev_pid=6803 prev_prio=120 prev_state=D ==> next_comm=swapper/3 next_pid=0 next_prio=120
userspace_bench-6798    [000] d....  2328.027162: sched_switch: prev_comm=userspace_bench prev_pid=6798 prev_prio=120 prev_state=R ==> next_comm=userspace_bench next_pid=6800 next_prio=120
userspace_bench-6799    [001] d....  2328.027162: sched_switch: prev_comm=userspace_bench prev_pid=6799 prev_prio=120 prev_state=R ==> next_comm=userspace_bench next_pid=6801 next_prio=120
userspace_bench-6800    [000] .....  2328.027165: contention_begin: 00000000ca66c48e (flags=READ)
userspace_bench-6801    [001] .....  2328.027166: contention_begin: 00000000ca66c48e (flags=READ)
userspace_bench-6800    [000] d....  2328.027167: sched_switch: prev_comm=userspace_bench prev_pid=6800 prev_prio=120 prev_state=D ==> next_comm=userspace_bench next_pid=6796 next_prio=120
userspace_bench-6801    [001] d....  2328.027167: sched_switch: prev_comm=userspace_bench prev_pid=6801 prev_prio=120 prev_state=D ==> next_comm=userspace_bench next_pid=6799 next_prio=120
userspace_bench-6796    [000] .....  2328.027168: contention_end: 00000000ca66c48e (ret=0)
^^^^ -- here the READer got the lock - it took ~3ms to get the lock
""""
without the commit we don't see any of the above waiting and the processes are only in the optimistic spinning.


In our situation the writer is doing this operation
"""
userspace_bench-6800     [031] .....     0.251700: contention_begin: 000000007a4d517c (ret=0)
userspace_bench-6800     [031] .....     0.251700: <stack trace>
=> __traceiter_contention_fastpath
=> rwsem_down_write_slowpath
=> down_write
=> kernfs_activate
=> kernfs_add_one
=> __kernfs_create_file
=> sysfs_add_file_mode_ns
=> internal_create_group
=> internal_create_groups.part.6
=> sysfs_create_groups
"""

and the reader is doing this - both are taking the same semaphore
"""
userspace_bench-6801     [095] .....     0.251700: contention_begin: 000000007a4d517c (flags=READ)
            userspace_bench-6801     [095] .....     0.251700: <stack trace>
=> __traceiter_contention_begin
=> rwsem_down_read_slowpath
=> down_read
=> kernfs_dop_revalidate
=> lookup_fast
=> walk_component
=> link_path_walk.part.74
=> path_openat
=> do_filp_open
=> do_sys_openat2
=> do_sys_open
"""

----

To help investigate this issue, I've created a minimal reproduction case:
1. Test repository: https://github.com/tomaskrcka/sysfs_bench
2. The test consists of:
  - A kernel module that creates sysfs interface and handles file operations
  - A userspace application that spawns writers (matching CPU core count) and readers

Using the test case on kernel 6.14.4, I collected the following measurements 
(100 samples each):

On AArch64 c8g (192 vCPUs):
- Without revert:
 * Avg: 3.50s (min: 3.39s, max: 3.63s, p99: 3.59s)
- With revert:
 * Avg: 2.70s (min: 2.65s, max: 3.10s, p99: 2.83s)
 * ~23% improvement

On x86 Ice Lake m6i (128 vCPUs):
- Without revert:
 * Avg: 6.71s (min: 6.61s, max: 7.60s, p99: 6.82s)
- With revert:
 * Avg: 6.28s (min: 5.89s, max: 7.52s, p99: 6.65s)
 * ~6% improvement

Could you take a look on that and let me know your thoughts ?

I’m happy to help with further investigation.

Thanks,
Tomas



Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ