linux-kernel - [REGRESSION] 76% performance loss in timer workloads caused by 513793bc6ab3 "posix-timers: Make signal delivery consistent"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAE4VaGBZzpkfkBXbiuED8Pv-UnjQ5xSk+t=dAdwSjv=u7-b8pw@mail.gmail.com>
Date: Sat, 16 Aug 2025 18:38:54 +0200
From: Jirka Hladky <jhladky@...hat.com>
To: linux-kernel <linux-kernel@...r.kernel.org>, Thomas Gleixner <tglx@...utronix.de>, 
	john.stultz@...aro.org, anna-maria@...utronix.de
Cc: Philip Auld <pauld@...hat.com>, Prarit Bhargava <prarit@...hat.com>, 
	Luis Goncalves <lgoncalv@...hat.com>, Miroslav Lichvar <mlichvar@...hat.com>, Luke Yang <luyang@...hat.com>, 
	Jan Jurca <jjurca@...hat.com>, Joe Mario <jmario@...hat.com>
Subject: [REGRESSION] 76% performance loss in timer workloads caused by
 513793bc6ab3 "posix-timers: Make signal delivery consistent"

Hello,

I'm reporting a performance regression in kernel 6.13 that causes a
76% performance loss in timer-heavy workloads. Through kernel
bisection, we have identified the root cause as commit
513793bc6ab331b947111e8efaf8fcef33fb83e5.

Summary

Regression: 76% performance drop in applications using nanosleep()/POSIX timers
 * 4.3x increase in timer overruns and voluntary context switches
  * Dramatic drop in timer completion rate (76% -> 20%)
  * Over 99% of timers fail to expire when timer migration is disabled in 6.13
Root Cause: commit 513793bc6ab3 "posix-timers: Make signal delivery consistent"
Impact: Timer signal delivery mechanism broken
Reproducer: stress-ng --timer workload on any system.

/usr/bin/time -v ./stress-ng --timer 1 -t 23 --verbose --metrics-brief
--yaml /dev/stdout 2>&1 | tee $(uname -r)_timer.log
grep -Poh 'bogo-ops-per-second-real-time: \K[0-9.]+' $(uname -r)_timer.log

6.12 kernel:
User time (seconds): 9.71
Percent of CPU this job got: 73%
stress-ng: metrc: [39351] stressor       bogo ops real time  usr time
sys time   bogo ops/s     bogo ops/s
stress-ng: metrc: [39351]                           (secs)    (secs)
 (secs)   (real time) (usr+sys time)
stress-ng: metrc: [39351] timer          11253022     23.01      9.71
    7.01    489125.18      673113.26
timer: 3655093 timer overruns (instance 0)
Voluntary context switches: 720747

6.13 kernel:
User time (seconds): 4.02
Percent of CPU this job got: 28%
stress-ng: metrc: [5416] stressor       bogo ops real time  usr time
sys time   bogo ops/s     bogo ops/s
stress-ng: metrc: [5416]                           (secs)    (secs)
(secs)   (real time) (usr+sys time)
stress-ng: metrc: [5416] timer           3103864     23.00      4.02
   2.08    134950.34      509002.47
timer: 15578896 timer overruns (instance 0)
Voluntary context switches: 3100815

CPU utilization dropped significantly, while timer overruns and
voluntary context switches increased by a factor of 4.3x.

It's interesting to examine hrtimer events with perf-record:
perf sched record -e timer:hrtimer_start -e timer:hrtimer_expire_entry
-e timer:hrtimer_expire_exit --output="hrtimer-$(uname -r).perf"
./stress-ng --timer 1 -t 23 --metrics-brief --yaml /dev/stdout
perf sched script -i "hrtimer-$(uname -r).perf" > "hrtimer-$(uname -r).txt"

grep -c hrtimer_start hrtimer*txt
6.12: 10898132
6.13: 17105314

grep -c hrtimer_expire_entry hrtimer-6.12.0-33.el10.x86_64.txt
hrtimer-6.13.0-0.rc2.22.eln144.x86_64.txt
6.12: 8358469
6.13: 3476757

The number of timers started increased significantly in 6.13, but most
timers do not expire. Completion rate went down from 76% to 20%

The next test was to disable timer migrations with the 6.13 kernel:
echo 0 > /proc/sys/kernel/timer_migration

6.13, /proc/sys/kernel/timer_migration set to zero
User time (seconds): 10.42
Percent of CPU this job got: 59%
stress-ng: metrc: [5927] stressor       bogo ops real time  usr time
sys time   bogo ops/s     bogo ops/s
stress-ng: metrc: [5927]                           (secs)    (secs)
(secs)   (real time) (usr+sys time)
stress-ng: metrc: [5927] timer           7004133     23.00     10.41
   3.11    304526.98      518257.73
timer: 7102554 timer overruns (instance 0)
Voluntary context switches: 7009365

Results improve, but there is still a 40% performance drop compared to
6.12 (489125 versus 304526 bogo ops/s).

I have also tried to add CPU pinning, but it had almost no effect:
6.13, /proc/sys/kernel/timer_migration set to zero, processed pin to one CPU:
$ taskset -c 10 /usr/bin/time -v ./stress-ng --timer 1 -t 23 --verbose
--metrics-brief 2>&1 | tee $(uname
-r)_timer_timer_migration_off_pinned.log
User time (seconds): 10.34
Percent of CPU this job got: 61%
stress-ng: metrc: [6230] stressor       bogo ops real time  usr time
sys time   bogo ops/s     bogo ops/s
stress-ng: metrc: [6230]                           (secs)    (secs)
(secs)   (real time) (usr+sys time)
stress-ng: metrc: [6230] timer           7129797     23.00     10.33
   3.53    309991.17      514479.47
timer: 7152958 timer overruns (instance 0)
Voluntary context switches: 7128460

Using perf record to trace hrtimer events reveals the following:

Kernel      hrtimer_start    hrtimer_expire_entry    Completion Rate
6.12         10,898,132         8,358,469               76.7%
6.13         17,105,314         3,476,757               20.3%
6.13+mig=0   17,067,784            30,841                0.18%

Over 99% of timers fail to expire properly in 6.13 with timer
migration disabled, indicating broken timer signal delivery.

We have collected results on a dual-socket Intel Emerald Rapids system
with 256 CPUs, but we observed the same problem on other systems as
well. Intel and AMD x86_64, aarch64, and ppc64le are all affected. The
regression is more pronounced on systems with higher CPU counts.

I have additional performance traces, perf data, and test
configurations available if needed for debugging. I'm happy to test
patches or provide more detailed analysis.

We have also tested kernel 6.16, and it behaves the same as kernel 6.13.

Thank you!
Jirka