linux-kernel - Re: [PATCH v22] edac, ras/hw_event.h: use events to handle hw issues

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120511172430.GB17299@aftab.osrc.amd.com>
Date:	Fri, 11 May 2012 19:24:30 +0200
From:	Borislav Petkov <bp@...64.org>
To:	Mauro Carvalho Chehab <mchehab@...hat.com>
Cc:	"Luck, Tony" <tony.luck@...el.com>,
	Linux Edac Mailing List <linux-edac@...r.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Doug Thompson <norsk5@...oo.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH v22] edac, ras/hw_event.h: use events to handle hw issues

On Fri, May 11, 2012 at 09:37:48AM -0300, Mauro Carvalho Chehab wrote:
> > HW_ERR is the "official" prefix used by the MCE code in the kernel.
> > Maybe we can shorten it but it is needed to raise attention when staring
> > at dmesg output.
> > 
> > Now, since this tracepoint is not dmesg, we don't need it there at all
> > since we know that trace_mc_error reports memory errors.
> 
> IMO, we can get rid of it on trace, keeping it at dmesg.
> 
> > "mc_error" is also not needed.
> 
> Some name is required there. This parameter affects:
> 	- the name of the tracepoint function;
> 	- the TP_printk() output;
> 	- the trace filter name.
> 
> mc_event is probably a good name.

If this tracepoint is going to report memory errors and we drop HW_ERR,
then I guess "mc_error" is the most fitting one because it says what the
message is.

> >> The error count msg ("1 error(s)") could be replaced by "count:1".
> > 
> > Is there even a possibility to report more than one error when invoking
> > trace_mc_error once? If not, simply drop the count completely.
> 
> Good point. It makes sense to add a count parameter to the error handling core, 
> in order to handle "count". AFAIKT, the only two drivers that currently reports
> error counts are sb_edac and i7core_edac.

What does that mean? You report multiple errors with one tracepoint
invocation? How do you report error details then?

If you only report single errors, error count will always be 1 so drop
it.

> With regards to sb_edac, it also needs another logic to be handled by
> the core: channel_mask. This is required to handle lockstep mode and
> mirror mode.

What does "handle lockstep mode" mean and how is that important for me
staring at the tracepoint output.

[ … ]

> > Here's how it should look:
> > 
> > kworker/u:6-201   [007] .N..   161.136624: [Hardware Error]: memory read on memory stick "DIMM_A1" (type: corrected socket:0 mc:0 channel:0 slot:0 rank:1 page:0x586b6e offset:0xa66 grain:32 syndrome:0x0 channel_mask:1)
> 
> As already explained: the error_msg is not consistent among the drivers.
> 
> So, "memory read on ..." will work fine on sb_edac/i7core_edac but it _won't_
> work on other drivers.
> 
> Changing this field on each driver requires deep knowledge of each memory
> controller, and not all datasheets are publicly available. 
> 
> For example, this is the code for Freescale MPC85xx EDAC driver:
> 
> 	if (err_detect & DDR_EDE_SBE)
> 		edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci,
> 				     pfn, err_addr & ~PAGE_MASK, syndrome,
> 				     row_index, 0, -1,
> 				     mci->ctl_name, "", NULL);
> 
> 	if (err_detect & DDR_EDE_MBE)
> 		edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci,
> 				     pfn, err_addr & ~PAGE_MASK, syndrome,
> 				     row_index, 0, -1,
> 				     mci->ctl_name, "", NULL);
> 
> It uses a blank value for the error message. putting the error_msg at the beginning,
> as you proposed would print: 
> 
> [Hardware Error]:  on memory stick "DIMM_A1" (type: corrected socket:0 mc:0 channel:0 slot:0 )

that's fine since we know the tracepoint purpose:

 "mc_error:  on memory stick "DIMM_A1" (type: corrected socket:0 mc:0 channel:0 slot:0)"

is pretty clear to me, IMHO.

> > * count is gone
> 
> While count is not properly addressed, I'll keep it on sb_edac/i7core_edac.
> 
> The right thing here seems to move it to a core property and increment the error counters
> according with the error count.

Incrementing the error counters shouldn't have anything to do with the
error reporting.

> > * MC-drivers shouldn't say "error" when reporting an error
> > * UE/CE moves into the brackets
> 
> It seems that both Tony and me agrees that UE/CE/Fatal should be outside the brackets.

Why?

> > * socket moves earlier in the brackets, and keep the whole deal hierarchical.
> 
> Socket doesn't bring a very relevant information. It provides the CPU socket with the core that
> were running the code when an error was generated, not the CPU that were managing the memory.

Then drop it if it's not relevant.

> > * drop "err_code" what is that?
> 
> In this case:
> 	u32 errcode = GET_BITFIELD(m->status, 0, 15);

Either decode it like I do in amd_decode_err_code() or remove it
completely - no naked bit values.

> > * drop second "socket"
> > * drop "area" Area "DRAM" - are there other?
> 
> Yes. there are 4 types of area at sb_edac.

And they are?

> > * what is "channel_mask"?
> 
> It is a bitmask mask showing all channels affected by an error, on sb_edac.
> Handing it is on my TODO list.

What can the user do with it when it sees it reported?

> > * move "rank" to earlier
> 
> Why? This is the least relevant information provided by the driver-specific logic:
> 
> 	snprintf(msg, sizeof(msg),
> 		 "count:%d%s%s area:%s err_code:%04x:%04x socket:%d channel_mask:%ld rank:%d",
> 		 core_err_cnt,
> 		 overflow ? " OVERFLOW" : "",
> 		 (uncorrected_error && recoverable) ? " recoverable" : "",
> 		 area_type,
> 		 mscod, errcode,
> 		 socket,
> 		 channel_mask,
> 		 rank);
> 
> Error count, overflow, recoverable bit, etc are more relevant than it, as it actually affects
> the user.
> 
> Rank, on the other hand, might only help someone interested on replacing a DRAM chip. Still,
> it is doubtful that this would be used, in practice, as companies that replaces DIMM chip likely
> have some specific equipments to test the DIMMs.
> 
> So, I almost dropped it. The only reason for me to not drop is that it allows me to compare
> the driver results with some other testing tools I'm using on driver's development.

Then drop it and keep a local version for your testing needs only.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/