linux-kernel - Re: [NAK] Re: [PATCH -v2 9/9] ACPI, APEI, Generic Hardware Error Source POLL/IRQ/NMI notification type support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20101026062653.GA27411@liondog.tnic>
Date:	Tue, 26 Oct 2010 08:26:53 +0200
From:	Borislav Petkov <bp@...en8.de>
To:	Tony Luck <tony.luck@...il.com>
Cc:	Ingo Molnar <mingo@...e.hu>, Huang Ying <ying.huang@...el.com>,
	Len Brown <lenb@...nel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Andi Kleen <andi@...stfloor.org>,
	"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
	Borislav Petkov <petkovbb@...glemail.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	"H. Peter Anvin" <hpa@...or.com>, Don Zickus <dzickus@...hat.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Mauro Carvalho Chehab <mchehab@...hat.com>
Subject: Re: [NAK] Re: [PATCH -v2 9/9] ACPI, APEI, Generic Hardware Error
 Source POLL/IRQ/NMI notification type support

On Mon, Oct 25, 2010 at 04:35:43PM -0700, Tony Luck wrote:
> On Mon, Oct 25, 2010 at 2:51 PM, Borislav Petkov <bp@...en8.de> wrote:
> > Concerning fatal errors, take a look at drivers/edac/mce_amd.(c|h)¹ -
> > this is not in arch/x86/ and still decodes MCEs in the kernel. And it
> > works fine - it even helped in several cases where people simply read
> > their serial console/dmesg and didn't have to collect it first and run
> > it through some tool to understand which functional unit in the CPU is
> > mchecking.
> 
> That looks neat ... but end-users seem to have some conflicting requirements
> here. Your uses seem to like it but the LLNL folks at the S.F. meeting said
> that solutions that involved looking at console logs from thousands
> of machines in a cluster were not acceptable.
> 
> I doubt very much if any end-user cares which unit *within* a cpu
> failed (their replaceable unit is the whole of the cpu). So much of
> your driver could be replaced with: printk("CPU%d is bad\n", cpu);

Yeah, nobody said this is finished. The next step is using perf
infrastructure to convey those decoded errors to userspace, say, to a
ras daemon or similar which can do all sorts of reporting, statistics,
policy decisions, injection, paint graphs, whatever...

I sent out two patchsets as an rfc already and am working
on the 3rd one so we're getting there. Here's the last one:
http://kerneltrap.org/mailarchive/linux-kernel/2010/8/6/4603847

Also, I'm open to all suggestions on how to make it more usable and
user-friendly.

Thanks.

-- 
Regards/Gruss,
    Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/