[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20101026062653.GA27411@liondog.tnic>
Date: Tue, 26 Oct 2010 08:26:53 +0200
From: Borislav Petkov <bp@...en8.de>
To: Tony Luck <tony.luck@...il.com>
Cc: Ingo Molnar <mingo@...e.hu>, Huang Ying <ying.huang@...el.com>,
Len Brown <lenb@...nel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Andi Kleen <andi@...stfloor.org>,
"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
Borislav Petkov <petkovbb@...glemail.com>,
Thomas Gleixner <tglx@...utronix.de>,
"H. Peter Anvin" <hpa@...or.com>, Don Zickus <dzickus@...hat.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Mauro Carvalho Chehab <mchehab@...hat.com>
Subject: Re: [NAK] Re: [PATCH -v2 9/9] ACPI, APEI, Generic Hardware Error
Source POLL/IRQ/NMI notification type support
On Mon, Oct 25, 2010 at 04:35:43PM -0700, Tony Luck wrote:
> On Mon, Oct 25, 2010 at 2:51 PM, Borislav Petkov <bp@...en8.de> wrote:
> > Concerning fatal errors, take a look at drivers/edac/mce_amd.(c|h)ยน -
> > this is not in arch/x86/ and still decodes MCEs in the kernel. And it
> > works fine - it even helped in several cases where people simply read
> > their serial console/dmesg and didn't have to collect it first and run
> > it through some tool to understand which functional unit in the CPU is
> > mchecking.
>
> That looks neat ... but end-users seem to have some conflicting requirements
> here. Your uses seem to like it but the LLNL folks at the S.F. meeting said
> that solutions that involved looking at console logs from thousands
> of machines in a cluster were not acceptable.
>
> I doubt very much if any end-user cares which unit *within* a cpu
> failed (their replaceable unit is the whole of the cpu). So much of
> your driver could be replaced with: printk("CPU%d is bad\n", cpu);
Yeah, nobody said this is finished. The next step is using perf
infrastructure to convey those decoded errors to userspace, say, to a
ras daemon or similar which can do all sorts of reporting, statistics,
policy decisions, injection, paint graphs, whatever...
I sent out two patchsets as an rfc already and am working
on the 3rd one so we're getting there. Here's the last one:
http://kerneltrap.org/mailarchive/linux-kernel/2010/8/6/4603847
Also, I'm open to all suggestions on how to make it more usable and
user-friendly.
Thanks.
--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists