[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20120227155426.GD3970@aftab>
Date: Mon, 27 Feb 2012 16:54:26 +0100
From: Borislav Petkov <bp@...64.org>
To: "Luck, Tony" <tony.luck@...el.com>
Cc: Steven Rostedt <rostedt@...dmis.org>,
Mauro Carvalho Chehab <mchehab@...hat.com>,
Ingo Molnar <mingo@...e.hu>,
edac-devel <linux-edac@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: RAS trace event proto
Hi Tony,
On Wed, Feb 22, 2012 at 04:59:48PM +0100, Borislav Petkov wrote:
> On Wed, Feb 22, 2012 at 11:43:24AM +0100, Borislav Petkov wrote:
> > This will keep the bloat level to a minimum, keep the TPs apart and
> > hopefully make all of us happy :).
>
> Btw, here's how the rough MCE TP trace_mce_record() looks like:
>
> mcegen.py-2715 [001] .N.. 1049.818840: mce_record: [Hardware Error]: CPU:0 MC4_STATUS[Over|UE|-|PCC|AddrV|UECC]: 0xf604a00006080a41
> [Hardware Error]: MC4_ADDR: 0xbabedeaddeadbeef
> [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected
> (CPU: 0, MCGc/s: 0/0, MC4: f604a00006080a41, ADDR/MISC: babedeaddeadbeef/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 0:0, TIME: 0, SOCKET: 0, APIC: 0)
>
> Basically, the userspace daemon will consume the error string (after
> it's been massaged into looking prettier and smaller :-)) (1st arg)
> and dump it to some logs, and use some of the MCE fields to do error
> collection and thresholding/ratelimiting/whatever.
>
> While at it, I'm also looking very critically at the fields SOCKET,
> APIC, TSC (we have walltime) for I'd like to drop them. Also, MC4 should
> be MC4_STATUS btw.
>
> To be continued...
new week, new stuff:
Here's how the MCE TP looks like with a couple of MCEs injected:
mcegen.py-2318 [001] .N.. 580.902409: mce_record: [Hardware Error]: CPU:0 MC4_STATUS[Over|CE|-|PCC|AddrV|CECC]: 0xd604c00006080a41 MC4_ADDR: 0x0000000000000016
[Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
[Hardware Error]: ERR_ADDR: 0x16 row: 0, channel: 0
[Hardware Error]: cache level: L1, mem/io: MEM, mem-tx: DWR, part-proc: RES (no timeout)
[Hardware Error]: CPU: 0, MCGc/s: 0/0, MC4: d604c00006080a41, ADDR/MISC: 0000000000000016/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, TIME: 0)
mcegen.py-2326 [001] .N.. 598.795494: mce_record: [Hardware Error]: CPU:0 MC4_STATUS[Over|UE|MiscV|PCC|-|UECC]: 0xfa002000001c011b[Hardware Error]: Northbridge Error (node 0): L3 ECC data cache error.
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[Hardware Error]: CPU: 0, MCGc/s: 0/0, MC4: fa002000001c011b, ADDR/MISC: 0000000000000016/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, TIME: 0)
mcegen.py-2343 [013] .N.. 619.620698: mce_record: [Hardware Error]: CPU:0 MC4_STATUS[-|UE|MiscV|PCC|-|UECC]: 0xba002100000f001b[Hardware Error]: Northbridge Error (node 0): GART Table Walk data error.
[Hardware Error]: cache level: L3/GEN, tx: GEN
[Hardware Error]: CPU: 0, MCGc/s: 0/0, MC4: ba002100000f001b, ADDR/MISC: 0000000000000016/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, TIME: 0)
Basically the lines excluding the last one are the string message
generated by the decoding code and collected into the ras decode buffer
using ras_printk. Btw, the buffer enlarges itself on demand when we're
close to filling it up with the decoding info.
The last line is the MCE TP with useless IMO fields removed which will
be used by the RAS daemon in userspace.
I'll be splitting the single patch into multiple, more digestible chunks
for review now.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists