[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CAE4VaGBZzpkfkBXbiuED8Pv-UnjQ5xSk+t=dAdwSjv=u7-b8pw@mail.gmail.com>
Date: Sat, 16 Aug 2025 18:38:54 +0200
From: Jirka Hladky <jhladky@...hat.com>
To: linux-kernel <linux-kernel@...r.kernel.org>, Thomas Gleixner <tglx@...utronix.de>,
john.stultz@...aro.org, anna-maria@...utronix.de
Cc: Philip Auld <pauld@...hat.com>, Prarit Bhargava <prarit@...hat.com>,
Luis Goncalves <lgoncalv@...hat.com>, Miroslav Lichvar <mlichvar@...hat.com>, Luke Yang <luyang@...hat.com>,
Jan Jurca <jjurca@...hat.com>, Joe Mario <jmario@...hat.com>
Subject: [REGRESSION] 76% performance loss in timer workloads caused by
513793bc6ab3 "posix-timers: Make signal delivery consistent"
Hello,
I'm reporting a performance regression in kernel 6.13 that causes a
76% performance loss in timer-heavy workloads. Through kernel
bisection, we have identified the root cause as commit
513793bc6ab331b947111e8efaf8fcef33fb83e5.
Summary
Regression: 76% performance drop in applications using nanosleep()/POSIX timers
* 4.3x increase in timer overruns and voluntary context switches
* Dramatic drop in timer completion rate (76% -> 20%)
* Over 99% of timers fail to expire when timer migration is disabled in 6.13
Root Cause: commit 513793bc6ab3 "posix-timers: Make signal delivery consistent"
Impact: Timer signal delivery mechanism broken
Reproducer: stress-ng --timer workload on any system.
/usr/bin/time -v ./stress-ng --timer 1 -t 23 --verbose --metrics-brief
--yaml /dev/stdout 2>&1 | tee $(uname -r)_timer.log
grep -Poh 'bogo-ops-per-second-real-time: \K[0-9.]+' $(uname -r)_timer.log
6.12 kernel:
User time (seconds): 9.71
Percent of CPU this job got: 73%
stress-ng: metrc: [39351] stressor bogo ops real time usr time
sys time bogo ops/s bogo ops/s
stress-ng: metrc: [39351] (secs) (secs)
(secs) (real time) (usr+sys time)
stress-ng: metrc: [39351] timer 11253022 23.01 9.71
7.01 489125.18 673113.26
timer: 3655093 timer overruns (instance 0)
Voluntary context switches: 720747
6.13 kernel:
User time (seconds): 4.02
Percent of CPU this job got: 28%
stress-ng: metrc: [5416] stressor bogo ops real time usr time
sys time bogo ops/s bogo ops/s
stress-ng: metrc: [5416] (secs) (secs)
(secs) (real time) (usr+sys time)
stress-ng: metrc: [5416] timer 3103864 23.00 4.02
2.08 134950.34 509002.47
timer: 15578896 timer overruns (instance 0)
Voluntary context switches: 3100815
CPU utilization dropped significantly, while timer overruns and
voluntary context switches increased by a factor of 4.3x.
It's interesting to examine hrtimer events with perf-record:
perf sched record -e timer:hrtimer_start -e timer:hrtimer_expire_entry
-e timer:hrtimer_expire_exit --output="hrtimer-$(uname -r).perf"
./stress-ng --timer 1 -t 23 --metrics-brief --yaml /dev/stdout
perf sched script -i "hrtimer-$(uname -r).perf" > "hrtimer-$(uname -r).txt"
grep -c hrtimer_start hrtimer*txt
6.12: 10898132
6.13: 17105314
grep -c hrtimer_expire_entry hrtimer-6.12.0-33.el10.x86_64.txt
hrtimer-6.13.0-0.rc2.22.eln144.x86_64.txt
6.12: 8358469
6.13: 3476757
The number of timers started increased significantly in 6.13, but most
timers do not expire. Completion rate went down from 76% to 20%
The next test was to disable timer migrations with the 6.13 kernel:
echo 0 > /proc/sys/kernel/timer_migration
6.13, /proc/sys/kernel/timer_migration set to zero
User time (seconds): 10.42
Percent of CPU this job got: 59%
stress-ng: metrc: [5927] stressor bogo ops real time usr time
sys time bogo ops/s bogo ops/s
stress-ng: metrc: [5927] (secs) (secs)
(secs) (real time) (usr+sys time)
stress-ng: metrc: [5927] timer 7004133 23.00 10.41
3.11 304526.98 518257.73
timer: 7102554 timer overruns (instance 0)
Voluntary context switches: 7009365
Results improve, but there is still a 40% performance drop compared to
6.12 (489125 versus 304526 bogo ops/s).
I have also tried to add CPU pinning, but it had almost no effect:
6.13, /proc/sys/kernel/timer_migration set to zero, processed pin to one CPU:
$ taskset -c 10 /usr/bin/time -v ./stress-ng --timer 1 -t 23 --verbose
--metrics-brief 2>&1 | tee $(uname
-r)_timer_timer_migration_off_pinned.log
User time (seconds): 10.34
Percent of CPU this job got: 61%
stress-ng: metrc: [6230] stressor bogo ops real time usr time
sys time bogo ops/s bogo ops/s
stress-ng: metrc: [6230] (secs) (secs)
(secs) (real time) (usr+sys time)
stress-ng: metrc: [6230] timer 7129797 23.00 10.33
3.53 309991.17 514479.47
timer: 7152958 timer overruns (instance 0)
Voluntary context switches: 7128460
Using perf record to trace hrtimer events reveals the following:
Kernel hrtimer_start hrtimer_expire_entry Completion Rate
6.12 10,898,132 8,358,469 76.7%
6.13 17,105,314 3,476,757 20.3%
6.13+mig=0 17,067,784 30,841 0.18%
Over 99% of timers fail to expire properly in 6.13 with timer
migration disabled, indicating broken timer signal delivery.
We have collected results on a dual-socket Intel Emerald Rapids system
with 256 CPUs, but we observed the same problem on other systems as
well. Intel and AMD x86_64, aarch64, and ppc64le are all affected. The
regression is more pronounced on systems with higher CPU counts.
I have additional performance traces, perf data, and test
configurations available if needed for debugging. I'm happy to test
patches or provide more detailed analysis.
We have also tested kernel 6.16, and it behaves the same as kernel 6.13.
Thank you!
Jirka
Powered by blists - more mailing lists