linux-kernel - Re: [PATCH RESEND] Prevent unexpected TSC to HPET clocksource fallback on many-socket systems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANDhNCqn__w4kGE2N6P5MndR4=2KwJnrb9=+eMo0=j5ToP6UZQ@mail.gmail.com>
Date: Mon, 2 Jun 2025 18:35:40 -0700
From: John Stultz <jstultz@...gle.com>
To: Daniel J Blueman <daniel@...ra.org>
Cc: Thomas Gleixner <tglx@...utronix.de>, Stephen Boyd <sboyd@...nel.org>, linux-kernel@...r.kernel.org, 
	stable@...nel.org, Scott Hamilton <scott.hamilton@...den.com>
Subject: Re: [PATCH RESEND] Prevent unexpected TSC to HPET clocksource
 fallback on many-socket systems

On Mon, Jun 2, 2025 at 3:34 PM Daniel J Blueman <daniel@...ra.org> wrote:
>
> On systems with many sockets, kernel timekeeping may quietly fallback from
> using the inexpensive core-level TSCs to the expensive legacy socket HPET,
> notably impacting application performance until the system is rebooted.
> This may be triggered by adverse workloads generating considerable
> coherency or processor mesh congestion.
>
> This manifests in the kernel log as:
>  clocksource: timekeeping watchdog on CPU1750: Marking clocksource 'tsc' as unstable because the skew is too large:
>  clocksource:                       'hpet' wd_nsec: 503029760 wd_now: 48a38f74 wd_last: 47e3ab74 mask: ffffffff
>  clocksource:                       'tsc' cs_nsec: 503466648 cs_now: 3224653e7bd cs_last: 3220d4f8d57 mask: ffffffffffffffff
>  clocksource:                       Clocksource 'tsc' skewed 436888 ns (0 ms) over watchdog 'hpet' interval of 503029760 ns (503 ms)
>  clocksource:                       'tsc' is current clocksource.
>  tsc: Marking TSC unstable due to clocksource watchdog
>  TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
>  sched_clock: Marking unstable (882011139159, 1572951254)<-(913395032446, -29810979023)
>  clocksource: Checking clocksource tsc synchronization from CPU 1800 to CPUs 0,187,336,434,495,644,1719,1792.
>  clocksource: Switched to clocksource hpet
>
> Scale the default timekeeping watchdog uncertinty margin by the log2 of
> the number of online NUMA nodes; this allows a more appropriate margin
> from embedded systems to many-socket systems.

So, missing context from the commit message:
* Why is it "appropriate" for the TSC and HPET to be further out of
sync on numa machines?
* Why is log2(numa nodes) the right metric to scale by?

> This fix successfully prevents HPET fallback on Eviden 12 socket/1440
> thread SH120 and 16 socket/1920 thread SH160 Intel SPR systems with
> Numascale XNC node controllers.

I recognize improperly falling back to HPET is costly and unwanted,
but given the history of bad TSCs, why is this loosening of the sanity
checks actually safe?

The skew you've highlighted above looks to be > 800ppm, which is well
beyond what NTP can correct for, so it might be good to better explain
why this skew is happening (you mention congestion, so is the skew
consistent, or short term due to read latencies? if so would trying
again or changing how we sample be more appropriate than just growing
the acceptable skew window?).

These sorts of checks were important before as NUMA systems might have
separate crystals on different nodes, so the TSCs (and HPETs) could
drift relative to each other, and ignoring such a problem could result
in visible TSC inconsistencies.  So I just want to make sure this
isn't solving an issue for you but opening a problem for someone else.

thanks
-john