[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aHliA1T5yyuMVDNk@hpe.com>
Date: Thu, 17 Jul 2025 15:50:11 -0500
From: Dimitri Sivanich <sivanich@....com>
To: Jiri Wiesner <jwiesner@...e.de>
Cc: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Jonathan Corbet <corbet@....net>, Steve Wahl <steve.wahl@....com>,
Justin Ernst <justin.ernst@....com>, Kyle Meyer <kyle.meyer@....com>,
Dimitri Sivanich <dimitri.sivanich@....com>,
Russ Anderson <russ.anderson@....com>,
Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH] x86: UV RTC: Add parameter to disable RTC clocksource
On Thu, Jul 17, 2025 at 05:44:45PM +0200, Jiri Wiesner wrote:
> Booting up an 8 NUMA node machine that has an UV RTC clocksource may
> result in the TSC being marked unstable by the clocksource watchdog due to
> time skew. The failures to verify the TSC happen soon after the current
> clocksource is switched to the TSC (usually the watchdog runs twice).
> Delaying the checks carried out by the clocksource watchdog after the
> system boots up does not make a difference.
>
> The clocksource watchdog compares two clocksources and it is assumed that
> it is always the clocksource being verified what has caused the time skew
> measured by the clocksource watchdog. To check the validity of this
> assumption, a debugging kernel was used. A third clocksource that was set
> to the HPET was added. The messages reported by the debugging kernel
> indicate that the time skew between the TSC and the HPET was only 22
> nanoseconds while the time skew between the TSC and sgi_rtc was 591659
> nanoseconds:
>
> clocksource: timekeeping watchdog on CPU176: Marking clocksource 'tsc' as unstable because the skew is too large:
> clocksource: 'sgi_rtc' wd_nsec: 479339803 wd_now: 1fab695e5a wd_last: 1f9e44dca0 mask: ffffffffffffff
> clocksource: 'hpet' wd2_nsec: 479931440 wd2_now: 90a1af85 wd2_last: 8fea9b37 mask: ffffffff
> clocksource: 'tsc' cs_nsec: 479931462 cs_now: 944e1c227d cs_last: 9412097879 mask: ffffffffffffffff
> clocksource: Clocksource 'tsc' skewed 591659 ns (0 ms) over watchdog 'sgi_rtc' interval of 479339803 ns (479 ms)
> clocksource: 'tsc' is current clocksource.
> tsc: Marking TSC unstable due to clocksource watchdog
> TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> sched_clock: Marking unstable (90731283360, -1108605523)<-(95136368481, -5513690634)
> clocksource: Checking clocksource tsc synchronization from CPU 501 to CPUs 0-500,502-767.
> clocksource: CPU 501 check durations 1446ns - 32908ns for clocksource tsc.
>
> This happened on CPU 176, which resides on NUMA node 3. The interval was
> computed from timestamps from CPU 176 and from CPU 175, which also resides
> on NUMA node 3. Since the time skew was reported between CPUs residing on
> the same NUMA node, it is unlikely that the TSC would experience time skew.
>
> The debugging kernel printed out the last message in
> clocksource_verify_percpu() unconditionally, and all CPUs were checked.
> None of the CPUs was reported as being behind or ahead of CPU 501. The
> last message provides a worst case estimate. The value of 2 * cs_nsec_max
> (2 * 32908 ns) is the maximum possible time skew between the TSCs of any
> two CPUs on the system, as measured by the TSC sync check. The cs_nsec_max
> value itself is an estimate because it includes delays incurred by
> executing and servicing an inter-processor interrupt synchronously, which
> has a non-negligible cost. The maximum possible time skew (of the TSC) of
> 66 microseconds does not even approach the size of the time skew measured
> by the clocksource watchdog.
>
> Testing has shown that the HPET is stabler than sgi_rtc so the HPET is a
> better choice for veryfying the TSC. Disabling the sgi_rtc clocksource was
> implemented as a workaround. The name of the parameter was inspired by
> 581f202bcd60 ("x86: UV RTC: Always enable RTC clocksource") and the fact
> that there also is a nohpet parameter and a notsc parameter. The uvrtcevt
> parameter has been documented.
>
On the face of it, the patch you're proposing looks OK to me, and continues the
precedent shown in other clocksources.
However, while the HPET may seem like a viable backup clocksource for purposes
of watchdog checking, it won't scale when assigned as an actual clocksource.
The UV RTC when used as an actual clocksource is more scalable than the HPET,
but it does have higher access latency than the TSC. TSC provides the low
access latency clocksource needed by many applications.
HPE UV hardware is designed to have a reliable and synchronized TSC mechanism.
Comparing the TSC against these secondary clocksources can result in false
positives due to variable access latency caused by system traffic. The best
course of action against these false positives has been found to simply disable
watchdog checking of the TSC. Currently we recommend that customers apply
'tsc=nowatchdog' to the kernel command line. Note that this has been enforced
in the kernel for other platforms with the following commits:
commit b50db7095fe002fa3e16605546cba66bf1b68a3e
Author: Feng Tang <feng.79.tang@...il.com>
Date: Wed Nov 17 10:37:51 2021 +0800
x86/tsc: Disable clocksource watchdog for TSC on qualified platorms
commit 233756a640be811efae33763db718fe29753b1e9
Author: Feng Tang <feng.79.tang@...il.com>
Date: Wed Jun 7 15:54:33 2023 +0800
x86/tsc: Extend watchdog check exemption to 4-Sockets platform
commit b4bac279319d3082eb42f074799c7b18ba528c71
Author: Feng Tang <feng.79.tang@...il.com>
Date: Mon Jul 29 10:12:02 2024 +0800
x86/tsc: Use topology_max_packages() to get package number
Going forward, we will likely submit a patch that disables clocksource watchdog
checking for newer UV systems in the kernel as well.
> Signed-off-by: Jiri Wiesner <jwiesner@...e.de>
> ---
> Documentation/admin-guide/kernel-parameters.txt | 4 ++++
> arch/x86/platform/uv/uv_time.c | 11 ++++++++++-
> 2 files changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 07e22ba5bfe3..9839257181e3 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4302,6 +4302,8 @@
> This is required for the Braillex ib80-piezo Braille
> reader made by F.H. Papenmeier (Germany).
>
> + nouvrtc [X86] Disable the UV RTC clocksource (SGI RTC clock).
> +
> nosgx [X86-64,SGX,EARLY] Disables Intel SGX kernel support.
>
> nosmap [PPC,EARLY]
> @@ -7839,6 +7841,8 @@
> 16 - SIGBUS faults
> Example: user_debug=31
>
> + uvrtcevt [X86] Use UV RTC clock events (SGI RTC clock) for timers.
> +
> vdso= [X86,SH,SPARC]
> On X86_32, this is an alias for vdso32=. Otherwise:
>
> diff --git a/arch/x86/platform/uv/uv_time.c b/arch/x86/platform/uv/uv_time.c
> index 3712afc3534d..03d59b87c371 100644
> --- a/arch/x86/platform/uv/uv_time.c
> +++ b/arch/x86/platform/uv/uv_time.c
> @@ -61,6 +61,7 @@ struct uv_rtc_timer_head {
> */
> static struct uv_rtc_timer_head **blade_info __read_mostly;
>
> +static int uv_rtc_enable = 1;
> static int uv_rtc_evt_enable;
>
> /*
> @@ -321,6 +322,14 @@ static void uv_rtc_interrupt(void)
> ced->event_handler(ced);
> }
>
> +static int __init uv_disable_rtc(char *str)
> +{
> + uv_rtc_enable = 0;
> +
> + return 1;
> +}
> +__setup("nouvrtc", uv_disable_rtc);
> +
> static int __init uv_enable_evt_rtc(char *str)
> {
> uv_rtc_evt_enable = 1;
> @@ -342,7 +351,7 @@ static __init int uv_rtc_setup_clock(void)
> {
> int rc;
>
> - if (!is_uv_system())
> + if (!uv_rtc_enable || !is_uv_system())
> return -ENODEV;
>
> rc = clocksource_register_hz(&clocksource_uv, sn_rtc_cycles_per_second);
> --
> 2.43.0
>
>
> --
> Jiri Wiesner
> SUSE Labs
Powered by blists - more mailing lists