lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3908561D78D1C84285E8C5FCA982C28F3294198E@ORSMSX114.amr.corp.intel.com>
Date:	Wed, 19 Nov 2014 23:34:10 +0000
From:	"Luck, Tony" <tony.luck@...el.com>
To:	Borislav Petkov <bp@...en8.de>,
	"ruiv.wang@...il.com" <ruiv.wang@...il.com>
CC:	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"gong.chen@...ux.intel.com" <gong.chen@...ux.intel.com>,
	"Wang, Rui Y" <rui.y.wang@...el.com>
Subject: RE: [PATCH v3] x86/mce: Try printing all machine check banks known
 before panic

>> No information besides that it is a machine check. This happens in two cases:
>> 1) The CPU logs the error with the MCi_STATUS.EN bit set to zero, and Linux
>>    ignores EN=0 entries (as it should).

> Well, I guess we shouldn't anymore. Apparently hw forgets to set the
> bit when raising an MCE so then we should ignore it too in mce-severity
> and delete that piece or grade it as higher severity based on, I dunno,
> b0rked hardware family/model/stepping or whatever bit we set...
>
>        MCESEV(
>                NO, "Not enabled",
>                BITCLR(MCI_STATUS_EN)
>                ),

The SDM has this to say about EN=0 (in section 15.10.4.1 of volume 3B):

   When the EN flag is zero but the VAL and UC flags are one in
   the IA32_MCi_STATUS register, the reported uncorrected error
   in this bank is not enabled. As uncorrected errors with the
   EN flag = 0 are not the source of machine check exceptions,
   the MCE handler should log and clear non-enabled errors when
   the S bit is set and should continue searching for enabled
   errors from the other IA32_MCi_STATUS registers. Note that
   when IA32_MCG_CAP [24] is 0, any uncorrected error condition
   (VAL =1 and UC=1) including the one with the EN flag cleared
   are fatal and the handler must signal the operating system to
   reset the system. For the errors that do not generate machine
   check exceptions, the EN flag has no meaning.

Note the "should log and clear".  We just clear ... just need to shuffle some code
in mce.c to add the logging.

But we still need something like Rui's patch - calling mcelog() doesn't ensure that
we see something on the console about possible cause of the problem.

-Tony

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ