[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4FA7F725.6070207@redhat.com>
Date: Mon, 07 May 2012 13:24:05 -0300
From: Mauro Carvalho Chehab <mchehab@...hat.com>
To: Borislav Petkov <bp@...64.org>
CC: Linux Edac Mailing List <linux-edac@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Doug Thompson <norsk5@...oo.com>,
Borislav Petkov <borislav.petkov@....com>
Subject: Re: [EDAC_ABI PATCH v13 01/26] amd64_edac: convert driver to use
the new edac ABI
Em 07-05-2012 13:12, Mauro Carvalho Chehab escreveu:
> Em 07-05-2012 11:31, Borislav Petkov escreveu:
>> Pasting latest version here:
>>
>>> From bdd1d4a73e48676e1aaab0dc41fca81e42d5e644 Mon Sep 17 00:00:00 2001
>>> From: Mauro Carvalho Chehab <mchehab@...hat.com>
>>> Date: Mon, 16 Apr 2012 15:03:50 -0300
>>> Subject: [PATCH] amd64_edac: convert driver to use the new edac ABI
>>>
>>> The legacy edac ABI is going to be removed. Port the driver to use
>>> and benefit from the new API functionality.
>>>
>>> Cc: Doug Thompson <norsk5@...oo.com>
>>> Cc: Borislav Petkov <borislav.petkov@....com>
>>> Signed-off-by: Mauro Carvalho Chehab <mchehab@...hat.com>
>>
>> Btw,
>>
>> now when I inject a correctable ECC error, I get:
>>
>> [ 2377.733708] [Hardware Error]: CPU:0 MC4_STATUS[-|CE|-|-|AddrV|CECC]: 0x945f4000b1080a13
>> [ 2377.742143] [Hardware Error]: MC4_ADDR: 0x000000005bac7388
>> [ 2377.742151] [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
>> [ 2377.742164] EDAC amd64 MC0: CE ERROR_ADDRESS= 0x5bac7388
>> [ 2377.742172] EDAC DEBUG: f1x_match_to_this_node: (range 0) SystemAddr= 0x5bac7388 Limit=0x437ffffff
>> [ 2377.742183] EDAC DEBUG: f1x_match_to_this_node: Normalized DCT addr: 0x2dd63980
>> [ 2377.742190] EDAC DEBUG: f1x_lookup_addr_in_dct: input addr: 0x2dd63980, DCT: 1
>> [ 2377.742199] EDAC DEBUG: f1x_lookup_addr_in_dct: CSROW=0 CSBase=0x0 CSMask=0xffffffe1fff9ffff
>> [ 2377.742206] EDAC DEBUG: f1x_lookup_addr_in_dct: (InputAddr & ~CSMask)=0x60000 (CSBase & ~CSMask)=0x0
>> [ 2377.742215] EDAC DEBUG: f1x_lookup_addr_in_dct: CSROW=1 CSBase=0x20000 CSMask=0xffffffe1fff9ffff
>> [ 2377.742223] EDAC DEBUG: f1x_lookup_addr_in_dct: (InputAddr & ~CSMask)=0x60000 (CSBase & ~CSMask)=0x20000
>> [ 2377.742231] EDAC DEBUG: f1x_lookup_addr_in_dct: CSROW=2 CSBase=0x40000 CSMask=0xffffffe1fff9ffff
>> [ 2377.742239] EDAC DEBUG: f1x_lookup_addr_in_dct: (InputAddr & ~CSMask)=0x60000 (CSBase & ~CSMask)=0x40000
>> [ 2377.742247] EDAC DEBUG: f1x_lookup_addr_in_dct: CSROW=3 CSBase=0x60000 CSMask=0xffffffe1fff9ffff
>> [ 2377.742255] EDAC DEBUG: f1x_lookup_addr_in_dct: (InputAddr & ~CSMask)=0x60000 (CSBase & ~CSMask)=0x60000
>> [ 2377.742262] EDAC DEBUG: f1x_lookup_addr_in_dct: MATCH csrow=3
>> [ 2377.742279] EDAC MC0: CE amd64_edac on unknown memory (csrow 3 channel 1 page 0x5bac7 offset 0x388 grain 0 syndrome 0xb1be )
>> ^^^^^^^^^^^^^^
>>
>> which is misleading. I think that removing the line from
>> edac_mc_handle_error() is better:
>>
>> - if (p == label)
>> - strcpy(label, "unknown memory");
>>
>> because this way we don't puzzle the user even more with EDAC-internal
>> details of how we represent DIMMs and ranks etc.
>
> That happens because the EDAC core couldn't find any EDAC labels for the affected
> memory.
>
> There are two reasons for seeing a "unknown memory":
> 1) there's no label information. This is fixed by:
> http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commit;h=b14dbb9b8220f8e07634125bf0e315f739cbf501
> 2) there's no info about the label e. g. something like the old
> edac_mc_handle_ce_no_info().
>
> As csrow and channel info is filled on your log, it is very likely to be
> caused by (1) and that you didn't load the labels for this system with edac-ctl.
>
> If you had a table with the labels for your motherboard at /etc/edac/labels.db
> and run "edac-ctl --load" during your system init (that's the default on RHEL/Fedora,
> not sure what other distros do), the message would be like:
>
> EDAC MC0: CE amd64_edac on DIMM_4B (csrow 3 channel 1 page 0x5bac7 offset 0x388 grain 0 syndrome 0xb1be )
>
>> IOW, simply having:
>>
>> [ 2377.742279] EDAC MC0: CE amd64_edac (csrow:3 channel:1 page:0x5bac7 offset:0x388 grain:0 syndrome:0xb1be)
>>
>> is much better IMO.
>>
>> Also, formatting that info with ":" makes the data even easily
>> understandable and parseable.
>
> Ok. I'll write a patch to replace it at the end of the series.
>
>> Also, you have a trailing space at the end: "... syndrome 0xb1be <---HERE)
>>
>> which needs to be removed.
>
> I'll do it also at the printk cleanup patch at the end of the series.
edac: Improve EDAC handle error logs
From: Mauro Carvalho Chehab <mchehab@...hat.com>
As suggested by Borislav:
- Instead of using space to split between a field and its value,
use ':';
- when no driver-specific error details are provided via
"other_detail" field, don't add an extra space.
Reported-by: Borislav Petkov <bp@...64.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@...hat.com>
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 4f4953c..0adbfa9 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -951,11 +951,18 @@ static void edac_ce_error(struct mem_ctl_info *mci,
{
unsigned long remapped_page;
- if (edac_mc_get_log_ce())
- edac_mc_printk(mci, KERN_WARNING,
- "CE %s on %s (%s%s %s)\n",
- msg, label, location,
- detail, other_detail);
+ if (edac_mc_get_log_ce()) {
+ if (other_detail && *other_detail)
+ edac_mc_printk(mci, KERN_WARNING,
+ "CE %s on %s (%s%s - %s)\n",
+ msg, label, location,
+ detail, other_detail);
+ else
+ edac_mc_printk(mci, KERN_WARNING,
+ "CE %s on %s (%s%s)\n",
+ msg, label, location,
+ detail);
+ }
edac_inc_ce_error(mci, enable_per_layer_report, pos);
if (mci->scrub_mode & SCRUB_SW_SRC) {
@@ -988,14 +995,26 @@ static void edac_ue_error(struct mem_ctl_info *mci,
const char *other_detail,
const bool enable_per_layer_report)
{
- if (edac_mc_get_log_ue())
- edac_mc_printk(mci, KERN_WARNING,
- "UE %s on %s (%s%s %s)\n",
- msg, label, location, detail, other_detail);
+ if (edac_mc_get_log_ue()) {
+ if (other_detail && *other_detail)
+ edac_mc_printk(mci, KERN_WARNING,
+ "UE %s on %s (%s%s - %s)\n",
+ msg, label, location, detail,
+ other_detail);
+ else
+ edac_mc_printk(mci, KERN_WARNING,
+ "UE %s on %s (%s%s)\n",
+ msg, label, location, detail);
+ }
- if (edac_mc_get_panic_on_ue())
- panic("UE %s on %s (%s%s %s)\n",
- msg, label, location, detail, other_detail);
+ if (edac_mc_get_panic_on_ue()) {
+ if (other_detail && *other_detail)
+ panic("UE %s on %s (%s%s - %s)\n",
+ msg, label, location, detail, other_detail);
+ else
+ panic("UE %s on %s (%s%s)\n",
+ msg, label, location, detail);
+ }
edac_inc_ue_error(mci, enable_per_layer_report, pos);
}
@@ -1139,7 +1158,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
if (pos[i] < 0)
continue;
- p += sprintf(p, "%s %d ",
+ p += sprintf(p, "%s:%d ",
edac_layer_name[mci->layers[i].type],
pos[i]);
}
@@ -1147,12 +1166,12 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
/* Memory type dependent details about the error */
if (type == HW_EVENT_ERR_CORRECTED)
snprintf(detail, sizeof(detail),
- "page 0x%lx offset 0x%lx grain %d syndrome 0x%lx",
+ "page:0x%lx offset:0x%lx grain:%d syndrome:0x%lx",
page_frame_number, offset_in_page,
grain, syndrome);
else
snprintf(detail, sizeof(detail),
- "page 0x%lx offset 0x%lx grain %d",
+ "page:0x%lx offset:0x%lx grain:%d",
page_frame_number, offset_in_page, grain);
/* Report the error via the trace interface */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists