linux-kernel - Re: [PATCH RESEND] Prevent unexpected TSC to HPET clocksource fallback on many-socket systems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAMVG2ssyhp2zFAOu74FUQtWn55XwUtiSp49FqNwy6Pj-f6sgQw@mail.gmail.com>
Date: Wed, 4 Jun 2025 16:01:31 +0800
From: Daniel J Blueman <daniel@...ra.org>
To: John Stultz <jstultz@...gle.com>
Cc: Thomas Gleixner <tglx@...utronix.de>, Stephen Boyd <sboyd@...nel.org>, linux-kernel@...r.kernel.org, 
	stable@...nel.org, Scott Hamilton <scott.hamilton@...den.com>
Subject: Re: [PATCH RESEND] Prevent unexpected TSC to HPET clocksource
 fallback on many-socket systems

On Tue, 3 Jun 2025 at 09:35, John Stultz <jstultz@...gle.com> wrote:
>
> On Mon, Jun 2, 2025 at 3:34 PM Daniel J Blueman <daniel@...ra.org> wrote:
> >
> > On systems with many sockets, kernel timekeeping may quietly fallback from
> > using the inexpensive core-level TSCs to the expensive legacy socket HPET,
> > notably impacting application performance until the system is rebooted.
> > This may be triggered by adverse workloads generating considerable
> > coherency or processor mesh congestion.
> >
> > This manifests in the kernel log as:
> >  clocksource: timekeeping watchdog on CPU1750: Marking clocksource 'tsc' as unstable because the skew is too large:
> >  clocksource:                       'hpet' wd_nsec: 503029760 wd_now: 48a38f74 wd_last: 47e3ab74 mask: ffffffff
> >  clocksource:                       'tsc' cs_nsec: 503466648 cs_now: 3224653e7bd cs_last: 3220d4f8d57 mask: ffffffffffffffff
> >  clocksource:                       Clocksource 'tsc' skewed 436888 ns (0 ms) over watchdog 'hpet' interval of 503029760 ns (503 ms)
> >  clocksource:                       'tsc' is current clocksource.
> >  tsc: Marking TSC unstable due to clocksource watchdog
> >  TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> >  sched_clock: Marking unstable (882011139159, 1572951254)<-(913395032446, -29810979023)
> >  clocksource: Checking clocksource tsc synchronization from CPU 1800 to CPUs 0,187,336,434,495,644,1719,1792.
> >  clocksource: Switched to clocksource hpet
> >
> > Scale the default timekeeping watchdog uncertainty margin by the log2 of
> > the number of online NUMA nodes; this allows a more appropriate margin
> > from embedded systems to many-socket systems.
>
> So, missing context from the commit message:
> * Why is it "appropriate" for the TSC and HPET to be further out of
> sync on numa machines?

I absolutely agree TSC skew is inappropriate. The TSCs are in sync
here using the same low-jitter base clock across all modules, meaning
this is an observability problem.

> * Why is log2(numa nodes) the right metric to scale by?

This is the simplest strategy I could determine to model latency from
the underlying cache coherency mesh congestion, and fits well with the
previous and future processor architectures.

> > This fix successfully prevents HPET fallback on Eviden 12 socket/1440
> > thread SH120 and 16 socket/1920 thread SH160 Intel SPR systems with
> > Numascale XNC node controllers.
>
> I recognize improperly falling back to HPET is costly and unwanted,
> but given the history of bad TSCs, why is this loosening of the sanity
> checks actually safe?

The current approach fails on large systems, therefore interconnect
market leaders of these 12-16 socket systems require users to boot
with "tsc=nowatchdog".

Since this change introduces scaling, it therefore conservatively
tightens the margin for 1-2 NUMA node systems; these values have been
historically appropriate.

> The skew you've highlighted above looks to be > 800ppm, which is well
> beyond what NTP can correct for, so it might be good to better explain
> why this skew is happening (you mention congestion, so is the skew
> consistent, or short term due to read latencies? if so would trying
> again or changing how we sample be more appropriate than just growing
> the acceptable skew window?).

For the workloads I instrumented, the read latencies aren't
consistently high enough to trip HPET fallback if there was further
retrying, so characterising the read latencies as 'bursty' might be
reasonable.

Ultimately, this reflects complex dependency patterns in inter and
intra-socket coherency queuing, so there is some higher baseline
latency.

> These sorts of checks were important before as NUMA systems might have
> separate crystals on different nodes, so the TSCs (and HPETs) could
> drift relative to each other, and ignoring such a problem could result
> in visible TSC inconsistencies.  So I just want to make sure this
> isn't solving an issue for you but opening a problem for someone else.

Yes, we didn't have an inter-module shared base clock in early cache
coherent interconnects. The hierarchical software clocksource mech I
developed closed the gap on near-TSC performance, though at higher
jitter of course.

Definitely agreed that we want to detect systematic TSC skew; I am
happy to prepare an alternative approach if preferred.

Many thanks for the discussion on this John,
  Dan
-- 
Daniel J Blueman