[<prev] [next>] [day] [month] [year] [list]
Message-ID: <37ed5ccf-7419-4311-ab53-b47fc93febae@meta.com>
Date: Tue, 24 Jun 2025 17:59:21 -0400
From: Chris Mason <clm@...a.com>
To: Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org,
Ingo Molnar <mingo@...nel.org>
Subject: [ANNOUNCE] rsched -- process scheduler statistics
Hi everyone,
I've been wanting to write up a tool to give a broad overview of process
scheduler metrics, and I used our recent regression hunt as an excuse
to work something up.
https://github.com/masoncl/rsched
rsched is meant to sit next to the running workload, kind of like vmstat or
top. The goal is just being able to quickly decide if I should blame the
scheduler, and maybe give some ideas about which part to blame. It doesn't
provide the full details like systing or wprof, both of which are much better
for in-depth analysis.
rsched is written in rust and uses BPF to watch a few scheduler tracepoints,
as well as some performance counters. I'm obviously still learning rust, so
don't be shy with the pull requests if you see chances to improve the code.
Any suggestions on how to make the output more correct, compact and/or usable
are always welcome. To give you an idea of the details, I'll give the
full output from a fast and slow schbench run. The README.md in github has
command line options etc.
Command line:
rsched -m -i 10 --schedstat -c schbench -C
This filters for just schbench procs, and turns on the CPU perf counters
v6.15-rc2 (the slow run)
Starting rsched - Runqueue scheduling delay tracker
Options active:
- Command pattern: schbench
- Not collapsing by command name
- Runtime limit: 12 seconds
Press Ctrl-C to stop...
Sched_waking tracepoints disabled for better performance
CPU performance counters enabled
=== Per-Process Scheduling Delays (microseconds) ===
Very Low (<10μs) (1028 entries):
PID COMMAND p50 p90 p99 COUNT
35886 schbench-worker 5 9 9 35133
35372 schbench-worker 5 9 9 35816
35183 schbench-worker 5 9 9 35491
35835 schbench-worker 5 9 9 35486
35943 schbench-worker 5 9 9 41423
35784 schbench-worker 5 9 9 36232
35572 schbench-worker 5 9 9 35537
35552 schbench-worker 5 9 9 41604
35308 schbench-worker 5 9 9 36427
35401 schbench-worker 5 9 9 32346
... and 1018 more
Low (10-100μs) (1 entries):
PID COMMAND p50 p90 p99 COUNT
35025 schbench-msg 9 47 110 171
=== Per-Process Time Slice Statistics (microseconds) ===
Time slice statistics grouped by preemption rate:
PID COMMAND INVOL_COUNT VOLUNTARY(p50/p90) PREEMPTED(p50/p90) PREEMPT%
35027 schbench-msg 101 - 95889/123483 100.0 %
35028 schbench-msg 100 - 96579/124173 100.0 %
35026 schbench-msg 99 - 96579/124173 100.0 %
35025 schbench-msg 170 - 50365/112503 100.0 %
35942 schbench-worker 4 5/9 15/17 0.0 %
35937 schbench-worker 3 5/9 10/15 0.0 %
35661 schbench-worker 3 5/9 13/16 0.0 %
35251 schbench-worker 3 5/9 10/20 0.0 %
35345 schbench-worker 3 5/9 13/16 0.0 %
35656 schbench-worker 3 5/9 13/16 0.0 %
... and 1019 more
=== Per-Process Sleep Duration Statistics (microseconds) ===
(Time spent sleeping between sched_switch and sched_wakeup)
Medium (100μs-1ms) (1024 entries):
PID COMMAND p50 p90 p99 COUNT
35173 schbench-worker 599 939 1015 26744
35910 schbench-worker 566 932 1014 29204
35148 schbench-worker 562 931 1014 28601
35782 schbench-worker 556 930 1014 29246
35985 schbench-worker 549 929 1014 28182
35877 schbench-worker 553 929 1014 29455
35524 schbench-worker 548 928 1014 29650
36010 schbench-worker 523 923 1013 30265
35343 schbench-worker 508 920 1013 31055
35118 schbench-worker 509 920 1013 30997
... and 1014 more
Very High (>10ms) (1 entries):
PID COMMAND p50 p90 p99 COUNT
35024 schbench 757304 990321 990321 9
=== Per-Process Runqueue Depth Statistics (nr_running at wakeup) ===
Processes by runqueue depth at wakeup:
PID COMMAND p50 p90 p99 COUNT
35886 schbench-worker 1 1 1 70704
35372 schbench-worker 1 1 1 72022
35183 schbench-worker 1 1 1 71410
35835 schbench-worker 1 1 1 71392
35943 schbench-worker 1 1 1 83351
35784 schbench-worker 1 1 1 72810
35572 schbench-worker 1 1 1 71386
35552 schbench-worker 1 1 1 83733
35308 schbench-worker 1 1 1 73241
35401 schbench-worker 1 1 1 65033
... and 1019 more
=== System-wide Schedstat Metrics (deltas) ===
alb_count 0 | lb_hot_gained_idle 0 | lb_nobusyg_idle 636568
alb_failed 0 | lb_hot_gained_newly_idle 0 | lb_nobusyg_newly_idle 3922226
alb_pushed 0 | lb_hot_gained_not_idle 0 | lb_nobusyg_not_idle 650
lb_balance_idle 676344 | lb_imbalance_load_idle 0 | lb_nobusyq_idle 171
lb_balance_newly_idle 3944496 | lb_imbalance_load_newly_idle 0 | lb_nobusyq_newly_idle 14806
lb_balance_not_idle 698 | lb_imbalance_load_not_idle 0 | lb_nobusyq_not_idle 0
lb_count_idle 678855 | lb_imbalance_misfit_idle 0 | sbe_balanced 0
lb_count_newly_idle 4845653 | lb_imbalance_misfit_newly_idle 0 | sbe_cnt 0
lb_count_not_idle 698 | lb_imbalance_misfit_not_idle 0 | sbe_pushed 0
lb_failed_idle 2399 | lb_imbalance_task_idle 2890 | sbf_balanced 0
lb_failed_newly_idle 896235 | lb_imbalance_task_newly_idle 2113470 | sbf_cnt 0
lb_failed_not_idle 0 | lb_imbalance_task_not_idle 0 | sbf_pushed 0
lb_gained_idle 112 | lb_imbalance_util_idle 0 | ttwu_move_affine 10159
lb_gained_newly_idle 4926 | lb_imbalance_util_newly_idle 0 | ttwu_move_balance 0
lb_gained_not_idle 0 | lb_imbalance_util_not_idle 0 | ttwu_wake_remote 37738806
=== CPU Field Totals (deltas) ===
yld_count 5685 | sched_count 0 | sched_goidle 75529957
ttwu_count 37721164 | ttwu_local 37785024 | rq_cpu_time 46219
rq_run_delay usec 287 | rq_pcount 673321838 |
=== Per-CPU Scheduling Delays (microseconds) ===
Very Low (<10μs): CPUs 1-251
Aggregate: p50=5 p90=9 p99=9 count=37575995
Low (10-100μs): CPUs 0
Aggregate: p50=6 p90=30 p99=95 count=350
=== Per-CPU Idle Duration (microseconds) ===
Very Short (<100μs): CPUs 4-15,130-141
Aggregate: p50=16 p90=50 p99=102 count=9636083
Short (100μs-1ms): CPUs 16-129,142-251
Aggregate: p50=50 p90=175 p99=396 count=28157410
=== CPU Performance Metrics ===
Global: User 5301.3M cycles/sec (IPC: 0.23), Kernel 23514.0M cycles/sec (IPC: 1.20)
CPU Performance by Usage Group (cycles are per timeslice):
Medium (100M-1G cycles p99) (4 entries):
PID COMMAND USER CYC(p50/p99) KERN CYC(p50/p99) U-IPC K-IPC
35026 schbench-msg 24.8M/33.4M 191.3M/266.9M 0.10 0.51
35027 schbench-msg 24.7M/33.4M 188.8M/265.3M 0.08 0.44
35028 schbench-msg 24.7M/33.4M 188.8M/265.3M 0.08 0.43
35025 schbench-msg 17.5M/33.2M 91.5M/263.2M 0.09 0.45
Very Low (<10M cycles p99) (1025 entries):
PID COMMAND USER CYC(p50/p99) KERN CYC(p50/p99) U-IPC K-IPC
35563 schbench-worker 1.1K/2.0K 6.4K/82.5K 0.43 1.23
35024 schbench 49.2K/65.5K 12.3K/16.4K 2.21 3.85
36040 schbench-worker 965/2.0K 6.2K/76.5K 0.45 1.31
35069 schbench-worker 1.4K/3.2K 6.3K/65.5K 0.33 1.32
35515 schbench-worker 1.3K/2.8K 6.0K/62.3K 0.39 1.42
35783 schbench-worker 1.6K/4.0K 6.4K/60.2K 0.23 1.33
36035 schbench-worker 1.8K/4.0K 6.4K/60.0K 0.22 1.40
35483 schbench-worker 1.8K/4.0K 6.4K/59.9K 0.22 1.41
35435 schbench-worker 2.3K/4.1K 6.4K/59.7K 0.22 1.38
35353 schbench-worker 1.6K/3.4K 6.4K/60.3K 0.27 1.39
... and 1015 more
-----------------------------------------------------------------------------------
v6.16-rc2 + a fix (the fast run)
The big change you can see here is the p99 of kernel cycles is
dramatically lower. Also, the schedstats for new idle balance
are much lower.
Starting rsched - Runqueue scheduling delay tracker
Options active:
- Command pattern: schbench
- Not collapsing by command name
- Runtime limit: 12 seconds
Press Ctrl-C to stop...
Sched_waking tracepoints disabled for better performance
CPU performance counters enabled
=== Per-Process Scheduling Delays (microseconds) ===
Very Low (<10μs) (1028 entries):
PID COMMAND p50 p90 p99 COUNT
3968481 schbench-worker 5 9 9 45999
3968652 schbench-worker 5 9 9 45966
3969197 schbench-worker 5 9 9 43535
3968362 schbench-worker 5 9 9 39690
3968832 schbench-worker 5 9 9 47972
3968660 schbench-worker 5 9 9 48020
3968393 schbench-worker 5 9 9 48051
3968468 schbench-worker 5 9 9 48038
3968967 schbench-worker 5 9 9 40810
3968635 schbench-worker 5 9 9 45979
... and 1018 more
Low (10-100μs) (1 entries):
PID COMMAND p50 p90 p99 COUNT
3968289 schbench-msg 5 18 100 120
=== Per-Process Time Slice Statistics (microseconds) ===
Time slice statistics grouped by preemption rate:
PID COMMAND INVOL_COUNT VOLUNTARY(p50/p90) PREEMPTED(p50/p90) PREEMPT%
3968292 schbench-msg 100 - 96938/124245 100.0 %
3968290 schbench-msg 100 - 96212/124100 100.0 %
3968291 schbench-msg 99 - 96938/124245 100.0 %
3968289 schbench-msg 119 - 86388/122135 100.0 %
3968426 schbench-worker 5 5/9 4/8 0.0 %
3968902 schbench-worker 4 5/9 6/10 0.0 %
3969157 schbench-worker 3 5/9 20/25 0.0 %
3969252 schbench-worker 3 5/9 10/15 0.0 %
3968776 schbench-worker 3 5/9 10/15 0.0 %
3968710 schbench-worker 3 5/9 10/15 0.0 %
... and 1019 more
=== Per-Process Sleep Duration Statistics (microseconds) ===
(Time spent sleeping between sched_switch and sched_wakeup)
Medium (100μs-1ms) (1024 entries):
PID COMMAND p50 p90 p99 COUNT
3968983 schbench-worker 438 487 714 31072
3968845 schbench-worker 421 486 708 32322
3968858 schbench-worker 421 482 633 34969
3969133 schbench-worker 414 480 600 36085
3968351 schbench-worker 314 479 577 36622
3968372 schbench-worker 414 475 511 37911
3968853 schbench-worker 398 475 511 38285
3968778 schbench-worker 329 471 510 39345
3969079 schbench-worker 313 470 507 38772
3969157 schbench-worker 387 469 508 39111
... and 1014 more
Very High (>10ms) (1 entries):
PID COMMAND p50 p90 p99 COUNT
3968288 schbench 757304 990321 990321 9
=== Per-Process Runqueue Depth Statistics (nr_running at wakeup) ===
Processes by runqueue depth at wakeup:
PID COMMAND p50 p90 p99 COUNT
3968288 schbench 1 2 3 20
3968481 schbench-worker 1 1 2 92460
3968652 schbench-worker 1 1 1 92512
3969197 schbench-worker 1 1 2 87580
3968362 schbench-worker 1 1 1 79681
3968832 schbench-worker 1 1 1 96353
3968660 schbench-worker 1 1 2 96443
3968393 schbench-worker 1 1 1 96642
3968468 schbench-worker 1 1 2 96595
3968967 schbench-worker 1 1 1 81998
... and 1019 more
=== System-wide Schedstat Metrics (deltas) ===
alb_count 2 | lb_hot_gained_idle 3 | lb_nobusyg_idle 594665
alb_failed 0 | lb_hot_gained_newly_idle 5 | lb_nobusyg_newly_idle 27247
alb_pushed 2 | lb_hot_gained_not_idle 0 | lb_nobusyg_not_idle 573
lb_balance_idle 618752 | lb_imbalance_load_idle 0 | lb_nobusyq_idle 338
lb_balance_newly_idle 27389 | lb_imbalance_load_newly_idle 0 | lb_nobusyq_newly_idle 62
lb_balance_not_idle 611 | lb_imbalance_load_not_idle 0 | lb_nobusyq_not_idle 0
lb_count_idle 620785 | lb_imbalance_misfit_idle 0 | sbe_balanced 0
lb_count_newly_idle 41461 | lb_imbalance_misfit_newly_idle 0 | sbe_cnt 0
lb_count_not_idle 611 | lb_imbalance_misfit_not_idle 0 | sbe_pushed 0
lb_failed_idle 1611 | lb_imbalance_task_idle 2710 | sbf_balanced 0
lb_failed_newly_idle 12835 | lb_imbalance_task_newly_idle 79009 | sbf_cnt 0
lb_failed_not_idle 0 | lb_imbalance_task_not_idle 0 | sbf_pushed 0
lb_gained_idle 429 | lb_imbalance_util_idle 0 | ttwu_move_affine 19124
lb_gained_newly_idle 1270 | lb_imbalance_util_newly_idle 0 | ttwu_move_balance 0
lb_gained_not_idle 0 | lb_imbalance_util_not_idle 0 | ttwu_wake_remote 46817555
=== CPU Field Totals (deltas) ===
yld_count 3565 | sched_count 0 | sched_goidle 93280935
ttwu_count 46387216 | ttwu_local 46865150 | rq_cpu_time 47597
rq_run_delay usec 80 | rq_pcount 2681345616 |
=== Per-CPU Scheduling Delays (microseconds) ===
Very Low (<10μs): CPUs 0-251
Aggregate: p50=5 p90=9 p99=9 count=46639878
=== Per-CPU Idle Duration (microseconds) ===
Very Short (<100μs): CPUs 4-15,126-141
Aggregate: p50=6 p90=16 p99=39 count=26621635
Short (100μs-1ms): CPUs 16-125,142-251
Aggregate: p50=73 p90=263 p99=431 count=19862521
=== CPU Performance Metrics ===
Global: User 5636.6M cycles/sec (IPC: 0.27), Kernel 23515.6M cycles/sec (IPC: 1.40)
CPU Performance by Usage Group (cycles are per timeslice):
Medium (100M-1G cycles p99) (4 entries):
PID COMMAND USER CYC(p50/p99) KERN CYC(p50/p99) U-IPC K-IPC
3968291 schbench-msg 24.9M/33.4M 190.4M/266.9M 0.13 0.73
3968290 schbench-msg 24.8M/33.4M 189.5M/266.9M 0.12 0.70
3968292 schbench-msg 24.7M/33.2M 188.8M/265.3M 0.13 0.74
3968289 schbench-msg 22.7M/33.2M 161.1M/264.9M 0.12 0.66
Very Low (<10M cycles p99) (1025 entries):
PID COMMAND USER CYC(p50/p99) KERN CYC(p50/p99) U-IPC K-IPC
3968288 schbench 39.3K/65.5K 13.7K/32.8K 4.13 4.15
3968914 schbench-worker 1.6K/3.9K 6.2K/14.9K 0.23 1.47
3968997 schbench-worker 1.7K/4.0K 6.2K/14.6K 0.22 1.45
3969198 schbench-worker 1.7K/4.0K 6.2K/14.6K 0.22 1.44
3968393 schbench-worker 2.0K/4.1K 6.2K/14.5K 0.24 1.49
3968372 schbench-worker 1.6K/3.4K 6.3K/15.2K 0.27 1.46
3968805 schbench-worker 1.5K/3.9K 6.2K/14.5K 0.25 1.50
3969261 schbench-worker 1.6K/4.0K 6.1K/14.3K 0.30 1.55
3968881 schbench-worker 1.8K/4.0K 6.2K/14.2K 0.26 1.51
3968511 schbench-worker 1.5K/4.0K 6.1K/14.1K 0.31 1.54
... and 1015 more
Powered by blists - more mailing lists