[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120515163855.GE27806@aftab.osrc.amd.com>
Date: Tue, 15 May 2012 18:38:55 +0200
From: Borislav Petkov <bp@...64.org>
To: Mauro Carvalho Chehab <mchehab@...hat.com>
Cc: "Luck, Tony" <tony.luck@...el.com>,
Linux Edac Mailing List <linux-edac@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Doug Thompson <norsk5@...oo.com>,
Steven Rostedt <rostedt@...dmis.org>,
Frederic Weisbecker <fweisbec@...il.com>,
Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH v22] edac, ras/hw_event.h: use events to handle hw issues
On Tue, May 15, 2012 at 01:05:48PM -0300, Mauro Carvalho Chehab wrote:
> > Here's what an error looks like on my system here:
> >
> > mcegen.py-2868 [007] .N.. 178.261607: mc_event: Corrected error:amd64_edac on memory stick "unknown memory" (mc:0 csrow:3 channel:1 page:0x5bac7 offset:0x388 grain:0 syndrome:0x34ed )
> >
> > There's still this trailing " " at the end of the error line which
> > shouldn't be there and also two spaces between "channel" and "page".
>
> If you take a look at the trace printk:
>
> + TP_printk("%s error:%s on memory stick \"%s\" (mc:%d %s %s %s)",
> + (__entry->err_type == HW_EVENT_ERR_CORRECTED) ? "Corrected" :
> + ((__entry->err_type == HW_EVENT_ERR_FATAL) ?
> + "Fatal" : "Uncorrected"),
> + __get_str(msg),
> + __get_str(label),
> + __entry->mc_index,
> + __get_str(location),
> + __get_str(detail),
> + __get_str(driver_detail))
>
> There are not extra spaces there. The first extra space is probably because
> there is an extra space at the label string. This should be easy to fix.
>
> The other extra space at the end is because amd64 currently doesn't provide
> driver_detail information.
Remind me again why do we need two strings: detail and driver_detail?
Because they could very well be lumped together with a single "%s"
format - "(mc:%d %s)" - and be printed.
And detail will always contain something which is not the empty string,
so problem solved.
> > Also, according to the output above "amd64_edac" is supposed to be
> > [error msg] which is strange.
> >
> > I believe this comes from this call in f1x_map_sysaddr_to_csrow():
> >
> > edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci,
> > page, offset, syndrome,
> > csrow, chan, -1,
> > EDAC_MOD_STR, "", NULL);
> >
> > I guess you want to do the following instead:
> >
> > mcegen.py-2868 [007] .N.. 178.261607: mc_event: amd64_edac: corrected error on memory stick "unknown memory" (mc:0 csrow:3 channel:1 page:0x5bac7 offset:0x388 grain:0 syndrome:0x34ed)
> >
> > maybe concatenate EDAC_MOD_STR with the proper string it reports, i.e.
> > corrected/uncorrected error?
>
> The issue here is because amd64_edac (just like a few other drivers) use
> its driver name (EDAC_MOD_STR) as the error message, instead of using
> something meaningful, like "read error" or "ECC error".
No, the issue is here that edac_mc_handle_ce() used to say "CE..."
and edac_mc_handle_ue() used to say "UE.. " and yours don't say that
anymore. In other words, you need to add the "CE/UE" thing to the string
based on the HW_EVENT_ERR_* flag or something to that effect.
[ … ]
> >> Of course, any userspace tools meant to handle errors should not parse
> >> the above data. They should, instead, use the binary fields provided by
> >> the tracepoint, mapping them directly into their MIBs.
> >
> > What is a MIB?
>
> Management Information Base. This is how anyone that works with Element
> Management calls the model of information that represents each management
> property. It is generally written using ITU-T ASN.1 syntax. Almost all
> management software use that.
>
> [1] http://en.wikipedia.org/wiki/Management_information_base
That looks like an ACPI or some other idiotic spec speak, pls remove it.
[ … ]
> >> + * edac_mc_handle_error - reports a memory event to userspace
> >> + *
> >> + * @type: severity of the error (CE/UE/Fatal)
> >> + * @mci: a struct mem_ctl_info pointer
> >> + * @page_frame_number: mem page where the error occurred
> >> + * @offset_in_page: offset of the error inside the page
> >> + * @syndrome: ECC syndrome
> >> + * @layer0: Memory layer0 position
> >> + * @layer1: Memory layer2 position
> >> + * @layer2: Memory layer3 position
> >> + * @msg: Message meaningful to the end users that
> >> + * explains the event
> >> + * @other_detail: Technical details about the event that
> >> + * may help hardware manufacturers and
> >> + * EDAC developers to analyse the event
> >
> > analyze it.
>
> Analyse is the same as analyze [2].
I know that. What I meant is
s/EDAC developers to analyse the event/EDAC developers to analyse it/
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists