[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20250511131922.145736-1-daniel@quora.org>
Date: Sun, 11 May 2025 21:19:21 +0800
From: Daniel J Blueman <daniel@...ra.org>
To: John Stultz <jstultz@...gle.com>,
Thomas Gleixner <tglx@...utronix.de>,
Stephen Boyd <sboyd@...nel.org>
Cc: linux-kernel@...r.kernel.org,
stable@...nel.org,
Daniel J Blueman <daniel@...ra.org>,
Scott Hamilton <scott.hamilton@...den.com>
Subject: [PATCH] Prevent unexpected TSC to HPET clocksource fallback on many-socket systems
On systems with many sockets, kernel timekeeping may quietly fallback from
using the inexpensive core-level TSCs to the expensive legacy socket HPET,
notably impacting application performance until the system is rebooted.
This may be triggered by adverse workloads generating considerable
coherency or processor mesh congestion.
This manifests in the kernel log as:
clocksource: timekeeping watchdog on CPU1750: Marking clocksource 'tsc' as unstable because the skew is too large:
clocksource: 'hpet' wd_nsec: 503029760 wd_now: 48a38f74 wd_last: 47e3ab74 mask: ffffffff
clocksource: 'tsc' cs_nsec: 503466648 cs_now: 3224653e7bd cs_last: 3220d4f8d57 mask: ffffffffffffffff
clocksource: Clocksource 'tsc' skewed 436888 ns (0 ms) over watchdog 'hpet' interval of 503029760 ns (503 ms)
clocksource: 'tsc' is current clocksource.
tsc: Marking TSC unstable due to clocksource watchdog
TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
sched_clock: Marking unstable (882011139159, 1572951254)<-(913395032446, -29810979023)
clocksource: Checking clocksource tsc synchronization from CPU 1800 to CPUs 0,187,336,434,495,644,1719,1792.
clocksource: Switched to clocksource hpet
Scale the default timekeeping watchdog uncertinty margin by the log2 of
the number of online NUMA nodes; this allows a more appropriate margin
from embedded systems to many-socket systems.
This fix successfully prevents HPET fallback on Eviden 12 socket/1440
thread SH120 and 16 socket/1920 thread SH160 Intel SPR systems with
Numascale XNC node controllers.
Reviewed-by: Scott Hamilton <scott.hamilton@...den.com>
Signed-off-by: Daniel J Blueman <daniel@...ra.org>
---
kernel/time/Kconfig | 8 +++++---
kernel/time/clocksource.c | 9 ++++++++-
2 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index b0b97a60aaa6..48dd517bc0b3 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -200,10 +200,12 @@ config CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
int "Clocksource watchdog maximum allowable skew (in microseconds)"
depends on CLOCKSOURCE_WATCHDOG
range 50 1000
- default 125
+ default 50
help
- Specify the maximum amount of allowable watchdog skew in
- microseconds before reporting the clocksource to be unstable.
+ Specify the maximum allowable watchdog skew in microseconds, scaled
+ by the log2 of the number of online NUMA nodes to track system
+ latency, before reporting the clocksource to be unstable.
+
The default is based on a half-second clocksource watchdog
interval and NTP's maximum frequency drift of 500 parts
per million. If the clocksource is good enough for NTP,
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index bb48498ebb5a..43e2e9cc086a 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -10,7 +10,9 @@
#include <linux/device.h>
#include <linux/clocksource.h>
#include <linux/init.h>
+#include <linux/log2.h>
#include <linux/module.h>
+#include <linux/nodemask.h>
#include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
#include <linux/tick.h>
#include <linux/kthread.h>
@@ -133,9 +135,12 @@ static u64 suspend_start;
* under test is not permitted to go below the 500ppm minimum defined
* by MAX_SKEW_USEC. This 500ppm minimum may be overridden using the
* CLOCKSOURCE_WATCHDOG_MAX_SKEW_US Kconfig option.
+ *
+ * If overridden, linearly scale this value by the log2 of the number of
+ * online NUMA nodes for a reasonable upper bound on system latency.
*/
#ifdef CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
-#define MAX_SKEW_USEC CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
+#define MAX_SKEW_USEC (CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US * max(ilog2(nr_online_nodes), 1))
#else
#define MAX_SKEW_USEC (125 * WATCHDOG_INTERVAL / HZ)
#endif
@@ -1195,6 +1200,8 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq
* comment preceding CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US above.
*/
if (scale && freq && !cs->uncertainty_margin) {
+ pr_info("Using clocksource watchdog maximum skew of %uus\n", MAX_SKEW_USEC);
+
cs->uncertainty_margin = NSEC_PER_SEC / (scale * freq);
if (cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW)
cs->uncertainty_margin = 2 * WATCHDOG_MAX_SKEW;
--
2.48.1
Powered by blists - more mailing lists