linux-kernel - Re: [NAK] Re: [PATCH -v2 9/9] ACPI, APEI, Generic Hardware Error Source POLL/IRQ/NMI notification type support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <AANLkTi=DEmss+3uX_5uw=VJDY9+cFPayjQkCQHa+r+Ye@mail.gmail.com>
Date:	Mon, 25 Oct 2010 14:23:12 -0700
From:	Tony Luck <tony.luck@...il.com>
To:	Borislav Petkov <bp@...en8.de>, Tony Luck <tony.luck@...il.com>,
	Ingo Molnar <mingo@...e.hu>, Huang Ying <ying.huang@...el.com>,
	Len Brown <lenb@...nel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Andi Kleen <andi@...stfloor.org>,
	"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
	Borislav Petkov <petkovbb@...glemail.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	"H. Peter Anvin" <hpa@...or.com>, Don Zickus <dzickus@...hat.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Mauro Carvalho Chehab <mchehab@...hat.com>
Subject: Re: [NAK] Re: [PATCH -v2 9/9] ACPI, APEI, Generic Hardware Error
 Source POLL/IRQ/NMI notification type support

On Mon, Oct 25, 2010 at 1:23 PM, Borislav Petkov <bp@...en8.de> wrote:

> You may be right but what we actually want is a consistent RAS
> infrastructure. Didn't you point out at the last edac meeting in Boston
> that concerning RAS Linux were in the stone ages? (at least this is what
> I remember reading).

That meeting was in San Francisco - but your recollection is correct.
Right now we have ways to count errors, and to attribute them to
specific hardware components (if we are lucky). This is only the
beginning of the feature set that is needed to be "advanced RAS".

> What we should do is put all that post-system-reset error info, ECC
> errors mapping to DRAM devices, L3 cache index manipulation based on
> excessive errors - you name it - together and stick it in ras/ or
> drivers/ras or whatever. And all with a nice and easy to use userspace
> tool on top.

This is what we should be working towards.  I don't think we have
a clear picture of what that high level infrastructure looks like. It
needs to be very flexible to take input from all sorts of platform
specific "driver" code that collects data.  The "perf events"
mechanism looks plausible as a transport mechanism for
reporting corrected (or otherwise non-fatal) events. But the
errors that didn't kill the system are only part of the RAS picture.

> Now it looks like a wart on arch/x86/ which truly doesn't belong there.
> And I don't buy all that crap that it can't be done right.

Of course it is a wart ... look up ACPI in any dictionary and you'll
find a picture of a stereotypical Halloween witch :-) I don't see other
architectures lining up to support ACPI ... but we shouldn't just
ignore it in x86.  The APEI pieces that were added to ACPI 4.0
have some interesting and useful features.  Most of them are
already implemented on shipping platforms because the APEI
bits were simply documenting WHEA (Windows Hardware Error
Architecture) features.  Look for this stuff in dmesg:

ACPI: HEST 000000007fb1c000 000A8 (v01 INTEL    SFC4UR 00000001 INTL 00000001)
ACPI: BERT 000000007fb1b000 00030 (v01 INTEL    SFC4UR 00000001 INTL 00000001)
ACPI: ERST 000000007fb1a000 00230 (v01 INTEL    SFC4UR 00000001 INTL 00000001)
ACPI: EINJ 000000007fb19000 00130 (v01 INTEL    SFC4UR 00000001 INTL 00000001)

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/