lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <0A46F54F-CEF5-42EE-8A95-F442FAD7A05D@amazon.de>
Date:   Thu, 7 Dec 2023 09:34:42 +0000
From:   "Sironi, Filippo" <sironi@...zon.de>
To:     Borislav Petkov <bp@...en8.de>
CC:     Tony Luck <tony.luck@...el.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "x86@...nel.org" <x86@...nel.org>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
        "hpa@...or.com" <hpa@...or.com>,
        "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>
Subject: Re: [PATCH] x86/MCE: Get microcode revision from cpu_data instead of
 boot_cpu_data

> > Boris, I just took a quick look and I might be missing something. If cores
> > fail to load the microcode or timeout, we taint the kernel, print an error
> > message, and then bubble up an error to userspace via:
> >
> > load_late_stop_cpus
> > load_late_locked
> > reload_store
> >
> > Right?
> 
> Yap.
> 
> > We would take servers that fail out of production;
> 
> And I'd like to hear about such issues. We added this failure checking
> only recently because something might go wrong and we want to warn. But
> it all updates fine here so kinda hard to test.

In a very large fleet, let's say that we have a handful of DPMs when considering
the entire processor, which means that in terms of cores, the defect rate is
much much lower.

What we've seen in these cases is that early loading - through the BIOS, I
actually never tried via the hypervisor - is successful while late loading
consistently fails. When it fails, we've seen two cases: 1/ the core still
reports the old microcode version or 2/ the core reports a bogus microcode
version (0xfffffffe is quite common, at least on Intel).

> My expectation is that if microcode fails loading on a subset of
> machines, the machine would more or less freeze. Depending, ofc, on what
> the microcode is updating...

It's bi-modal. We've seen servers that move along till we take them out of
production as well as servers that fail with an MCE of some sort likely leading
to a CATERR/IERR.

> > however, for others it might be interesting to have the correct
> > information. The patch - with a reworked commit message - might still
> > be useful to a few.
> 
> 
> https://lore.kernel.org/r/20231118193248.1296798-3-yazen.ghannam@amd.com <mailto:20231118193248.1296798-3-yazen.ghannam@....com>
> 
> 
> :)

:looking:




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ