lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3908561D78D1C84285E8C5FCA982C28F2DA884F0@ORSMSX106.amr.corp.intel.com>
Date:	Wed, 19 Jun 2013 21:28:50 +0000
From:	"Luck, Tony" <tony.luck@...el.com>
To:	Borislav Petkov <bp@...en8.de>
CC:	"Naveen N. Rao" <naveen.n.rao@...ux.vnet.ibm.com>,
	"ananth@...ibm.com" <ananth@...ibm.com>,
	"masbock@...ux.vnet.ibm.com" <masbock@...ux.vnet.ibm.com>,
	"lcm@...ux.vnet.ibm.com" <lcm@...ux.vnet.ibm.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
	"Huang, Ying" <ying.huang@...el.com>,
	"Robert Richter" <rric@...nel.org>
Subject: RE: [PATCH v2 2/2] mce: acpi/apei: Add a boot option to disable ff
 mode for corrected errors

> Ok, where is that semantics? What in a CPER record does say "this error
> should tell you that you need to offline the containing page and I'm
> telling you this exactly only once"? Error Severity 0, i.e. Recoverable?

Naveen - this one is for you (or for your BIOS team).  Can you get us a sample
CPER that you plan to provide when the BIOS decides that its threshold has
been exceeded?  How will it be different from what old WSM-EX platforms
were sending to us?  Hopefully the answer is encoded in the CPER record
and not in some code we have to put in Linux to say "if (IBMplatform) do_thing_1(); else ... "

> Ok, we're talking about the S in RAS now. Do we have error recovery
> strategies specified anywhere? Are they per-platform or generic? Is this
> CPER strategy above, for example, only valid for some platforms or for
> all APEI-using hardware?

mcelog(8) daemon has been doing this for years ... but it used the "predictive
failure analysis" buzzwords that were popular way back then (today the
marketing people seem to prefer "self healing" ). Whatever the name, the
concept is the same ... take some set of corrected event reports and infer
from them that something worse may happen soon, and use that information
to try to avoid the (possibly) impending crash.

> Questions over questions...

Questions are good - they help fill out gaps

-Tony

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ