lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 11 Nov 2014 18:44:17 +0000
From:	"Luck, Tony" <tony.luck@...el.com>
To:	Borislav Petkov <bp@...en8.de>
CC:	Aravind Gopalakrishnan <aravind.gopalakrishnan@....com>,
	Chen Yucong <slaoub@...il.com>,
	"ak@...ux.intel.com" <ak@...ux.intel.com>,
	"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v3 1/2] x86, mce, severity: extend the the mce_severity
 mechanism to handle UCNA/DEFERRED error

>> The bank 7 error reported as severity 0 because EN=0 ... so we took no action for it.
>
> How come EN is 0? Bank7 error reporting is not enabled? Why? Or the
> error injection thing doesn't do it?

The "EN" bit is poorly named, and not well documented.  Here's a clip from the SDM:

One of bullets in 15.10.4.1 Machine-Check Exception Handler for Error Recovery

 When the EN flag is zero but the VAL and UC flags are one in the
 IA32_MCi_STATUS register, the reported uncorrected error in this bank
 is not enabled. As uncorrected errors with the EN flag = 0 are not the
 source of machine check exceptions, the MCE handler should log and clear
 non-enabled errors when the S bit is set and should continue searching
 for enabled errors from the other IA32_MCi_STATUS registers. Note that
 when IA32_MCG_CAP [24] is 0, any uncorrected error condition (VAL =1
 and UC=1) including the one with the EN flag cleared are fatal and the
 handler must signal the operating system to reset the system. For the
 errors that do not generate machine check exceptions, the EN flag has
 no meaning. See Chapter 19: Table 19-15 to find the errors that do not
 generate machine check exceptions.

Unfortunately the reference to chapter 19 is stale (that is now all about
performance monitoring - I'll log a bug with the SDM editor to find the
right reference and fix this).

What this is trying to say is that the "EN" bit is to enable signaling
of machine checks - so it only has meaning when checking banks from the
machine check handler.  Errors that are logged, but not signaled, or signaled
as CMCI will have MCi_STATUS.EN=0


>> The bank 3 error got past that hurdle, then through the next BIT(8) set indicates a
>> cache error. Fell at the last check because ADDRV=0.
>
> I guess you could tweak the injection path to write in a default address
> so that that check gets bypassed...

I don't think this is an injection artifact. I think on this processor the mid-level-cache
just isn't providing an address in this case.  It doesn't help to make one up - our whole
game plan is to offline a page with a UC error - and we must have an address to know
which page to offline.

Perhaps the severity table entries for UCNA and DEFERRED errors should look to see
if ADDRV is set - if not, don't report this as UCNA/DEFERRED?

-Tony

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ