lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANDhNCqn__w4kGE2N6P5MndR4=2KwJnrb9=+eMo0=j5ToP6UZQ@mail.gmail.com>
Date: Mon, 2 Jun 2025 18:35:40 -0700
From: John Stultz <jstultz@...gle.com>
To: Daniel J Blueman <daniel@...ra.org>
Cc: Thomas Gleixner <tglx@...utronix.de>, Stephen Boyd <sboyd@...nel.org>, linux-kernel@...r.kernel.org, 
	stable@...nel.org, Scott Hamilton <scott.hamilton@...den.com>
Subject: Re: [PATCH RESEND] Prevent unexpected TSC to HPET clocksource
 fallback on many-socket systems

On Mon, Jun 2, 2025 at 3:34 PM Daniel J Blueman <daniel@...ra.org> wrote:
>
> On systems with many sockets, kernel timekeeping may quietly fallback from
> using the inexpensive core-level TSCs to the expensive legacy socket HPET,
> notably impacting application performance until the system is rebooted.
> This may be triggered by adverse workloads generating considerable
> coherency or processor mesh congestion.
>
> This manifests in the kernel log as:
>  clocksource: timekeeping watchdog on CPU1750: Marking clocksource 'tsc' as unstable because the skew is too large:
>  clocksource:                       'hpet' wd_nsec: 503029760 wd_now: 48a38f74 wd_last: 47e3ab74 mask: ffffffff
>  clocksource:                       'tsc' cs_nsec: 503466648 cs_now: 3224653e7bd cs_last: 3220d4f8d57 mask: ffffffffffffffff
>  clocksource:                       Clocksource 'tsc' skewed 436888 ns (0 ms) over watchdog 'hpet' interval of 503029760 ns (503 ms)
>  clocksource:                       'tsc' is current clocksource.
>  tsc: Marking TSC unstable due to clocksource watchdog
>  TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
>  sched_clock: Marking unstable (882011139159, 1572951254)<-(913395032446, -29810979023)
>  clocksource: Checking clocksource tsc synchronization from CPU 1800 to CPUs 0,187,336,434,495,644,1719,1792.
>  clocksource: Switched to clocksource hpet
>
> Scale the default timekeeping watchdog uncertinty margin by the log2 of
> the number of online NUMA nodes; this allows a more appropriate margin
> from embedded systems to many-socket systems.

So, missing context from the commit message:
* Why is it "appropriate" for the TSC and HPET to be further out of
sync on numa machines?
* Why is log2(numa nodes) the right metric to scale by?

> This fix successfully prevents HPET fallback on Eviden 12 socket/1440
> thread SH120 and 16 socket/1920 thread SH160 Intel SPR systems with
> Numascale XNC node controllers.

I recognize improperly falling back to HPET is costly and unwanted,
but given the history of bad TSCs, why is this loosening of the sanity
checks actually safe?

The skew you've highlighted above looks to be > 800ppm, which is well
beyond what NTP can correct for, so it might be good to better explain
why this skew is happening (you mention congestion, so is the skew
consistent, or short term due to read latencies? if so would trying
again or changing how we sample be more appropriate than just growing
the acceptable skew window?).

These sorts of checks were important before as NUMA systems might have
separate crystals on different nodes, so the TSCs (and HPETs) could
drift relative to each other, and ignoring such a problem could result
in visible TSC inconsistencies.  So I just want to make sure this
isn't solving an issue for you but opening a problem for someone else.

thanks
-john

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ