[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3908561D78D1C84285E8C5FCA982C28F32950618@ORSMSX114.amr.corp.intel.com>
Date: Fri, 21 Nov 2014 21:59:49 +0000
From: "Luck, Tony" <tony.luck@...el.com>
To: Borislav Petkov <bp@...en8.de>
CC: rui wang <ruiv.wang@...il.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"gong.chen@...ux.intel.com" <gong.chen@...ux.intel.com>,
"Wang, Rui Y" <rui.y.wang@...el.com>
Subject: RE: [PATCH v3] x86/mce: Try printing all machine check banks known
before panic
>> That means there were no VALID=1, EN=1, S=1 errors anywhere. But there
>> might be some other things logged that would help us understand.
>
> By "other things" you mean other MCEs?
Logs with EN=0 and/or S=0. They may have interesting information, and have
a good chance of being useful (especially if they are from some functional
unit that isn't part of the buggy behavior. Bad data flowing through multiple
functional units can leave a trail of logged entries (perhaps as many as four
units may see and log a single error). Only one of them should signal the machine
check (to avoid shutdown because of nested machine check).
> Oh, cpu errata. So this would mean that we can't even rely on the
> contents of the MCA banks, can we?
>
> In any case, is any of the information in the MCA banks in such cases
> even usable then? Because if not, we're definitely barking up the wrong
> tree...
See above - I think even if there is a bug in the core that isn't setting the
right bits in the MCi_STATUS register - we could get good data from
devices out in the uncore.
-Tony
Powered by blists - more mailing lists