lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.21.1908292225000.1938@nanos.tec.linutronix.de>
Date:   Thu, 29 Aug 2019 23:38:11 +0200 (CEST)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Kai-Heng Feng <kai.heng.feng@...onical.com>
cc:     Ingo Molnar <mingo@...nel.org>, Borislav Petkov <bp@...en8.de>,
        "H. Peter Anvin" <hpa@...or.com>, harry.pan@...el.com,
        x86@...nel.org, LKML <linux-kernel@...r.kernel.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Daniel Drake <drake@...lessm.com>,
        Dan Williams <dan.j.williams@...el.com>,
        "Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
        Len Brown <lenb@...nel.org>,
        Tom Lendacky <thomas.lendacky@....com>, Pu Wen <puwen@...on.cn>
Subject: [RFD] x86/tsc: Loosen the requirements for watchdog - (was x86/hpet:
 Disable HPET on Intel Coffe Lake)

On Thu, 29 Aug 2019, Thomas Gleixner wrote:
> On Thu, 29 Aug 2019, Kai-Heng Feng wrote:
> > I know we should find the root cause rather than stopping at "it’s a firmware
> > bug”, but users are already affected by this issue [1].
> > Is there any better short-term workaround?
> 
> Not really. And if Intel stays silent, I'm just going to apply it as is
> along with a stable tag.

Summary for those who are new on CC:

   Coffee Lake machines have a C10 state wrecked HPET which causes the TSC
   clocksource watchdog to misbehave which is not surprising as that's like
   trying to monitor an atomic clock with a sun-dial.

   So the intention is to disable HPET on those machines which affects also
   Kaby Lake CPUs as they share the model number and just differ in the
   stepping. Unless we get precise information from Intel which steppings
   are affected and that these are the only ones, we won't go down the
   stepping road as that is going to be an endless whack a mole game. Tried
   that before and got burned...

While disabling HPET sounds trivial, this can have side effects.

If the HPET is not available for whatever reason the kernel will use
ACPI_PMTIMER as fallback clocksource for monitoring the TSC if the affected
systems actually advertise it. If not that will effectively disable NOHZ
and high resolution timers. Disabling NOHZ is a pain for power consumption
and those machines are mostly laptops I assume.

Now there is something we can consider to do:

These CPUs have finally a working and usable TSC - knock on wood!

Just for the record: That's 20+ years after we started to asked for it!

The TSC has constant frequency and does not stop in deeper C-states. Aside
of that these CPUs have the TSC_ADJUST MSR which allows us to figure out
when the BIOS/SMM manages to wreckage the TSC on a CPU by writing to it for
completely wrong reasons.

So we could finally start to trust TSC at least on single socket systems.

Multi-socket is a different story as the sockets might drift apart for
reasons which I really don't want to discuss in this context for CoC's
sake. So we definitely want a watchdog there as TSC ADJUST is not able to
catch those issues.

So if we have to disable the HPET on Kaby Lake alltogether unless Intel
comes up with the clever fix, i.e. poking at the right registers, then I
think we should also lift the TSC watchdog restrictions on these machines
if they are single socket, which they are as the affected CPUs so far are
mobile and client types.

Also given the fact that we get more and more 'reduced' hardware exposed
via ACPI and we already dealt with quite some fallout with various related
issues due to that, I fear we need to bite this bullet anyway anytime soon.

But TBH, 20+ years exposure to subtly wrecked timer hardware has left quite
a few scars.

I put AMD/HYGON folks on CC as well as they will run into similar problems
sooner than later and their CPUs still do not have the TSC_ADJUST MSR which
is paramount to loosen the watchdog restrictions. Hint, hint, hint...

Thoughts?

Thanks,

	tglx

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ