[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4F35270F.1020402@redhat.com>
Date: Fri, 10 Feb 2012 12:17:51 -0200
From: Mauro Carvalho Chehab <mchehab@...hat.com>
To: Borislav Petkov <bp@...64.org>
CC: Linux Edac Mailing List <linux-edac@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v3 01/31] events/hw_event: Create a Hardware Events Report
Mecanism (HERM)
Em 10-02-2012 11:41, Borislav Petkov escreveu:
> On Thu, Feb 09, 2012 at 10:01:00PM -0200, Mauro Carvalho Chehab wrote:
>> In order to provide a proper hardware event subsystem, let's
>> encapsulate hardware events into a common trace facility, and
>> make both edac and mce drivers to use it. After that, common
>> facilities can be moved into a new core for hardware events
>> reporting subsystem. This patch is the first of a series, and just
>> touches at mce.
>
> I think it would work too if you had only one event:
>
> * trace_hw_error(...)
>
> which would have as an argument a string describing it, like
> "Uncorrected Memory Read Error", "Memory Read Error (out of range)" "TLB
> Multimatch Error" etc., followed by the rest of the error info.
>
> Currently, you're introducing at least 5 trace_* calls _only_ for memory
> errors. What about the remaining couples of tens of errors which haven't
> been addressed yet?
Good point.
The way I see it is that:
- a non-memory related, non-parsed MCE event would generate a "mce_record" trace
(we need an additional patch to disable it when the error is parsed.
I'll address it after finishing the tests with a few other platforms);
As more MCE parsers are added at the core, the situations where such event will
be generated will reduce, and will eventually disappear in long term.
- a non-x86 event (or a x86 event for a memory controller that is not addressed
by MCE events) will use a "mc_error";
- a x86 event generated via MCE will use a "mc_error_mce".
There are two special events defined when there's a memory error _and_ a driver
bug:
"mc_out_of_range_mce" and "mc_out_of_range".
While the name of them and one of the parameters are memory-controller specific,
it should be easy to make it generic enough to be used by other types of errors.
The previous EDAC logic were to generate an out of range printk and return. With
the changes I made, it is possible to let the EDAC to provide the information
parsed, just discarding the bad parsed value. That's the approach I took, as the
other information there may be useful. By taking such approach, the MCE information
will be shown by the "mc_error_mce" trace. So, we can remove the "mc_out_of_range_mce"
without loosing any information.
In any case, we can't merge the *_mce with the non-mce variant, as the mce.h header
is arch specific and doesn't exist on PPC and tilera architectures.
So, the only event that we can actually remove is "mc_out_of_range_mce", if we let
the core generate two events for badly parsed error events. What do you think?
Regards,
Mauro
>
> Thanks.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists