[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20101105134658.GA24828@aftab>
Date: Fri, 5 Nov 2010 14:46:58 +0100
From: Borislav Petkov <bp@...64.org>
To: Mauro Carvalho Chehab <mchehab@...radead.org>
Cc: "acme@...radead.org" <acme@...radead.org>,
"fweisbec@...il.com" <fweisbec@...il.com>,
"mingo@...e.hu" <mingo@...e.hu>,
"peterz@...radead.org" <peterz@...radead.org>,
"rostedt@...dmis.org" <rostedt@...dmis.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH 00/20] RAS daemon v3
On Fri, Nov 05, 2010 at 08:02:34AM -0400, Mauro Carvalho Chehab wrote:
> I tried to apply your patches here, but they didn't apply. i suspect
> that Steven added some patches there at the meantime, as two patches
> on your series are already on his tree. IMO, the better would be if
> you could create a temporary tree or branch to allow us to better view
> it.
Sure:
git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git ras-v3
> This example looks quite ugly to me. I doubt anyone without a
> datasheet and after a very careful inspection would know what
> 0x9c00410000010016 magic number means.
Right, this was only a hands-on example of what otherwise a script does.
I wanted to show what happens in detail.
> I suspect that writing a wrong magic number will also produce a
> completely undesired result.
That's not a problem since this is software-only injection. It actually
makes sense to be able to inject crap so that you can test the decoding
code:
[81953.494078] [Hardware Error]: MC5_STATUS: Uncorrected error, other errors lost: no, CPU context corrupt: yes, UECC Error
[81953.505714] [Hardware Error]: Corrupted FR MCE info?
[81953.505718] [Hardware Error]: Transaction: GEN (GEN), no timeout, Cache Level: L3/GEN, Participating Processor: GEN
> So, the better it to keep the MCE code
> internally to the driver.
>
> Also, writing a magic number to a node named as "status" seems weird to me.
>
> IMO, instead, it should be something like:
>
> echo 1 >/sys/devices/system/edac/mce/error_inject
Well, this way you inject a random error. But you want to control the
error types which you inject and set not only one but a couple of the
MCi_ bank MSRs. In that manner, you can inject the address at which a
certain MCE happens and so on.
So, basically, the long term goal is to have a tool which could do all
that. Maybe something like this:
perf inject --mce --functional-unit DC --uncorrectable --pcc-corrupt --virtual-address 0xdeadbeef ...
or
perf inject --mce --functional-unit IC --random --correctable --ecc
(I have long options so that it's clear what we do - we can make them
shorter in the actual case.) But you get the idea. This way, you can
inject all kinds of stuff and also in a human-readable form.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists