lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <1e3c711f-8c96-4c39-bbe2-7742940d1d31@meta.com>
Date: Wed, 7 May 2025 19:13:18 -0400
From: Chris Mason <clm@...a.com>
To: Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org
Subject: scheduler performance regression since v6.11

Hi everyone,

I've spent some time trying to track down a regression in a networking
benchmark, where it looks like we're spending roughly 10% more time in
new idle balancing than 6.9 did.

I'm not sure if I've reproduced that exact regression, but with some
changes to schbench, I was able to bisect a regression of some kind down
to commits in v6.11.

The actual result of the bisect was:

There are only 'skip'ped commits left to test.
The first bad commit could be any of:
781773e3b68031bd001c0c18aa72e8470c225ebd
e1459a50ba31831efdfc35278023d959e4ba775b
f12e148892ede8d9ee82bcd3e469e6d01fc077ac
152e11f6df293e816a6a37c69757033cdc72667d
2e0199df252a536a03f4cb0810324dff523d1e79
54a58a78779169f9c92a51facf6de7ce94962328
We cannot bisect more!

And this roughly matches the commits that introduce DELAY_DEQUEUE and
DELAY_ZERO.  Sadly, echo NO_DELAY_DEQUEUE/ZERO don't fix the regression
in 6.15.  I had to skip ~5 commits because they didn't compile, but
Peter might have already worked that part of the bisect out by the time
this email hits the lists.

I've landed a few modifications to schbench
(https://github.com/masoncl/schbench), but the basic workload is:

 - 4 message threads
  - each message thread pinned to its own single CPU
 - each waking up 128 worker threads
  - each worker sharing all the remaining CPUs

Neither the message threads or the worker threads are doing anything
other than waking up and going to sleep, so the whole RPS count from
schbench is basically just how fast can 4 threads wake up everyone else
on the system.

The answer is that v6.9 can do it roughly 8% faster than 6.11+.  I've
tried on a few different CPUs some have bigger or smaller deltas but the
regression is consistently there.

The exact command line I'm using:

schbench -L -m 4 -M auto -t 128 -n 0 -r 60

-L turns off the locking complexity I use to simulate our web workload
-m 4 is 4 message threads
-t 128 is 128 workers per message thread
-n 0 turns off all the think time
-r 60 is runtime in seconds
-M auto is the new CPU pinning described above

I haven't actually tried to fix this yet, but I do have some profiling
that might help, or is maybe interesting.  The most important thing in
the data down below is probably that 6.9 is calling available_idle_cpu()
16M times in 10 seconds, and 6.15 is calling it 56M times in 10 seconds.

The next step for me is to stall for time while hoping someone else
fixes this, but I'll try and understand why we're available_idle_cpuing
so hard in later kernels.

6.15.0-rc5 (git sha 0d8d44db295c, Linus as of May 6)

schbench RPS percentiles (requests) runtime 60 (s) (61 total samples)
	  20.0th: 1767424    (13 samples)
	* 50.0th: 1816576    (18 samples)
	  90.0th: 1841152    (27 samples)
	  min=1733207, max=2027049
average rps: 1813674.47

v6.9

schbench RPS percentiles (requests) runtime 60 (s) (61 total samples)
	  20.0th: 1955840    (14 samples)
	* 50.0th: 1968128    (17 samples)
	  90.0th: 1984512    (26 samples)
	  min=1908492, max=2122446
average rps: 1972418.82

6.9 perf from CPU #2

    15.50%  schbench  [kernel.kallsyms]  [k] available_idle_cpu
     7.45%  schbench  [kernel.kallsyms]  [k] llist_add_batch
     6.92%  schbench  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     5.90%  schbench  [kernel.kallsyms]  [k] futex_wake
     4.62%  schbench  schbench           [.] fpost
     4.57%  schbench  [kernel.kallsyms]  [k] futex_wake_mark
     4.46%  schbench  [kernel.kallsyms]  [k] select_task_rq_fair
     4.26%  schbench  [kernel.kallsyms]  [k] _raw_spin_lock
     4.19%  schbench  schbench           [.] xlist_wake_all
     3.95%  schbench  [kernel.kallsyms]  [k] _find_next_bit
     3.55%  schbench  [kernel.kallsyms]  [k] __futex_unqueue

I don't know why perf record -g and perf report -g aren't giving me call
graphs, must be something funky with the fleet version of perf and my
hand built kernels.  But here's the top call stack for
available_idle_cpu(), along with line numbers from blazesym.

12313 samples (6.66%) Comms: schbench
available_idle_cpu @ linux/kernel/sched/core.c:7437:7
idle_cpu @ linux/kernel/sched/core.c:7415:5 [inlined]
select_idle_core @ linux/kernel/sched/fair.c:7301:6
select_task_rq_fair @ linux/kernel/sched/fair.c:8219:13
select_idle_sibling @ linux/kernel/sched/fair.c:7618:6 [inlined]
select_idle_cpu @ linux/kernel/sched/fair.c:7420:24 [inlined]
try_to_wake_up @ linux/kernel/sched/core.c:4363:9
select_task_rq @ linux/kernel/sched/core.c:3637:9 [inlined]
wake_up_q @ linux/kernel/sched/core.c:1030:3
put_task_struct @ linux/include/linux/sched/task.h:127:7 [inlined]
futex_wake @ linux/kernel/futex/waitwake.c:200:9
do_futex @ linux/kernel/futex/syscalls.c:131:1
__x64_sys_futex @ linux/kernel/futex/syscalls.c:160:1
__se_sys_futex @ linux/kernel/futex/syscalls.c:160:1 [inlined]
__do_sys_futex @ linux/kernel/futex/syscalls.c:179:9 [inlined]
do_syscall_64 @ linux/arch/x86/entry/common.c:83:7
do_syscall_x64 @ linux/arch/x86/entry/common.c:52:12 [inlined]
entry_SYSCALL_64_after_hwframe

Call frequency count of available_idle_cpu() (10 seconds of sampling,
system wide)

@counts: 16273429

v6.15-rc5 (sha 0d8d44db295c)

perf from CPU #2

    17.53%  schbench  [kernel.kallsyms] [k] available_idle_cpu
     7.11%  schbench  [kernel.kallsyms] [k] llist_add_batch
     5.75%  schbench  [kernel.kallsyms] [k] futex_wake
     4.98%  schbench  [kernel.kallsyms] [k] try_to_wake_up
     4.89%  schbench  [kernel.kallsyms] [k] futex_wake_mark
     4.51%  schbench  schbench          [.] fpost
     4.48%  schbench  [kernel.kallsyms] [k] select_task_rq_fair
     4.28%  schbench  [kernel.kallsyms] [k] _find_next_bit
     4.21%  schbench  [kernel.kallsyms] [k] _raw_spin_lock
     3.98%  schbench  schbench          [.] xlist_wake_all
     3.20%  schbench  [kernel.kallsyms] [k] migrate_task_rq_fair
     2.95%  schbench  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
     2.90%  schbench  [kernel.kallsyms] [k] select_idle_core.constprop.0
     2.83%  schbench  [kernel.kallsyms] [k] __futex_unqueue
     2.78%  schbench  [kernel.kallsyms] [k] set_task_cpu
     2.70%  schbench  [kernel.kallsyms] [k] clear_bhb_loop
     2.14%  schbench  [kernel.kallsyms] [k] remove_entity_load_avg
     2.10%  schbench  [kernel.kallsyms] [k] ttwu_queue_wakelist
     2.09%  schbench [kernel.kallsyms] [k] call_function_single_prep_ipi
     1.34%  schbench  libc.so.6         [.] syscall

17128 samples (9.06%) Comms: schbench
available_idle_cpu @ linux/kernel/sched/syscalls.c:228:7
idle_cpu @ linux/kernel/sched/syscalls.c:206:5 [inlined]
select_idle_core @ linux/kernel/sched/fair.c:7621:6
select_task_rq_fair @ linux/kernel/sched/fair.c:8637:13
select_idle_sibling @ linux/kernel/sched/fair.c:7938:6 [inlined]
select_idle_cpu @ linux/kernel/sched/fair.c:7740:24 [inlined]
try_to_wake_up @ linux/kernel/sched/core.c:4313:9
select_task_rq @ linux/kernel/sched/core.c:3583:9 [inlined]
wake_up_q @ linux/kernel/sched/core.c:1081:3
put_task_struct @ linux/./include/linux/sched/task.h:134:7 [inlined]
futex_wake @ linux/kernel/futex/waitwake.c:200:9
do_futex @ linux/kernel/futex/syscalls.c:131:1
__x64_sys_futex @ linux/kernel/futex/syscalls.c:160:1
__se_sys_futex @ linux/kernel/futex/syscalls.c:160:1 [inlined]
__do_sys_futex @ linux/kernel/futex/syscalls.c:179:9 [inlined]
do_syscall_64 @ linux/arch/x86/entry/syscall_64.c:94:7
do_syscall_x64 @ linux/arch/x86/entry/syscall_64.c:63:12 [inlined]
entry_SYSCALL_64_after_hwframe

Call frequency count of available_idle_cpu(), system-wide, 10 seconds:

@counts: 55867130

-chris



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ