linux-kernel - RE: [PATCH v3] x86/mce: Try printing all machine check banks known before panic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3908561D78D1C84285E8C5FCA982C28F32950618@ORSMSX114.amr.corp.intel.com>
Date:	Fri, 21 Nov 2014 21:59:49 +0000
From:	"Luck, Tony" <tony.luck@...el.com>
To:	Borislav Petkov <bp@...en8.de>
CC:	rui wang <ruiv.wang@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"gong.chen@...ux.intel.com" <gong.chen@...ux.intel.com>,
	"Wang, Rui Y" <rui.y.wang@...el.com>
Subject: RE: [PATCH v3] x86/mce: Try printing all machine check banks known
 before panic

>> That means there were no VALID=1, EN=1, S=1 errors anywhere.  But there
>> might be some other things logged that would help us understand.
>
> By "other things" you mean other MCEs?

Logs with EN=0 and/or S=0.  They may have interesting information, and have
a good chance of being useful (especially if they are from some functional
unit that isn't part of the buggy behavior. Bad data flowing through multiple
functional units can leave a trail of logged entries (perhaps as many as four
units may see and log a single error). Only one of them should signal the machine
check (to avoid shutdown because of nested machine check). 

> Oh, cpu errata. So this would mean that we can't even rely on the
> contents of the MCA banks, can we?
>
> In any case, is any of the information in the MCA banks in such cases
> even usable then? Because if not, we're definitely barking up the wrong
> tree...

See above - I think even if there is a bug in the core that isn't setting the
right bits in the MCi_STATUS register - we could get good data from
devices out in the uncore.

-Tony