[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87sehh2gw8.ffs@tglx>
Date: Sun, 24 Aug 2025 11:44:23 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: Jirka Hladky <jhladky@...hat.com>, linux-kernel
<linux-kernel@...r.kernel.org>, john.stultz@...aro.org,
anna-maria@...utronix.de
Cc: Philip Auld <pauld@...hat.com>, Prarit Bhargava <prarit@...hat.com>,
Luis Goncalves <lgoncalv@...hat.com>, Miroslav Lichvar
<mlichvar@...hat.com>, Luke Yang <luyang@...hat.com>, Jan Jurca
<jjurca@...hat.com>, Joe Mario <jmario@...hat.com>
Subject: Re: [REGRESSION] 76% performance loss in timer workloads caused by
513793bc6ab3 "posix-timers: Make signal delivery consistent"
On Sat, Aug 16 2025 at 18:38, Jirka Hladky wrote:
> I'm reporting a performance regression in kernel 6.13 that causes a
> 76% performance loss in timer-heavy workloads.
Are you talking about real world workloads or about the stress-ng bogosity?
> Through kernel bisection, we have identified the root cause as commit
> 513793bc6ab331b947111e8efaf8fcef33fb83e5.
>
> Summary
>
> Regression: 76% performance drop in applications using nanosleep()/POSIX timers
> * 4.3x increase in timer overruns and voluntary context switches
> * Dramatic drop in timer completion rate (76% -> 20%)
> * Over 99% of timers fail to expire when timer migration is disabled in 6.13
> Root Cause: commit 513793bc6ab3 "posix-timers: Make signal delivery consistent"
> Impact: Timer signal delivery mechanism broken
> Reproducer: stress-ng --timer workload on any system.
That does:
arm_timer()
{
timer.it_value.tv_sec = ...;
timer.it_value.tv_nsec = ...;
timer.it_interval.tv_sec = timer.it_value.tv_sec;
timer.it_interval.tv_nsec = timer.it_value.tv_nsec;
timer_settime(....&timer);
}
and in the signal handler it does:
...
timer_getoverrun();
arm_timer();
So from the kernel POV this means:
user space starts timer
arm_timer()
....
hrtimer_start()
...
hrtimer_expire()
raise_signal()
signal_delivery()
if (interval > 0)
#1 hrtimer_start()
user space signal_handler()
arm_timer()
hrtimer_cancel();
#2 clear pending and overrun
hrtimer_start();
So it's exactly doing what user space asks for.
Older kernels accounted for overruns and pending signals which might
have accumulated between #1 and #2, which is undefined behaviour as user
space cannot longer differentiate to which arming the expiry or the
overruns belong.
So clearing it when rearmed is the obvious correct thing to do because
it makes it consistent, no?
The same applies for the disarm scenario:
arm_timer()
...
expires()
raise_signal()
disarm_timer()
...
discard signal
Older kernels did not discard it, but that makes zero sense because
after disarming the timer both the signal and the overrun becomes
immediately meaningless, no?
And this has nothing to do with timer migration or whatever, that's just
a matter of correctness.
If you can point me to a real world workload, which uses timers
correctly and does not just do random stuff with them, I'm happy to look
into it.
But this stress-ng thing is just made up nonsense which created bogus
statistics forever. So comparing bogus numbers is not an indicator for
a real regression.
Thanks,
tglx
Powered by blists - more mailing lists