linux-kernel - RE: [PATCH RFC x86/mce] Make mce_timed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <366fc78e7b8c4474958b289eec31ed25@intel.com>
Date:   Thu, 7 Jan 2021 00:26:19 +0000
From:   "Luck, Tony" <tony.luck@...el.com>
To:     "paulmck@...nel.org" <paulmck@...nel.org>
CC:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "x86@...nel.org" <x86@...nel.org>,
        "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
        "bp@...en8.de" <bp@...en8.de>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "hpa@...or.com" <hpa@...or.com>,
        "kernel-team@...com" <kernel-team@...com>
Subject: RE: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs

> Please see below for an updated patch.

Yes. That worked:

[   78.946069] mce: mce_timed_out: MCE holdout CPUs (may include false positives): 24-47,120-143
[   78.946151] mce: mce_timed_out: MCE holdout CPUs (may include false positives): 24-47,120-143
[   78.946153] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler

I guess that more than one CPU hit the timeout and so your new message was printed twice
before the panic code took over?

Once again, the whole of socket 1 is MIA rather than just the pair of threads on one of the cores there.
But that's a useful improvement (eliminating the other three sockets on this system).

Tested-by: Tony Luck <tony.luck@...el.com>

-Tony