lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120511102527.GI8913@aftab.osrc.amd.com>
Date:	Fri, 11 May 2012 12:25:27 +0200
From:	Borislav Petkov <bp@...64.org>
To:	Mauro Carvalho Chehab <mchehab@...hat.com>
Cc:	"Luck, Tony" <tony.luck@...el.com>,
	Linux Edac Mailing List <linux-edac@...r.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Doug Thompson <norsk5@...oo.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH v22] edac, ras/hw_event.h: use events to handle hw issues

On Thu, May 10, 2012 at 10:48:40PM -0300, Mauro Carvalho Chehab wrote:
> Em 10-05-2012 19:37, Luck, Tony escreveu:
> >      kworker/u:6-201   [007] .N..   186.197280: mc_error: [Hardware Error]: mem_ctl#0: Corrected error memory read error on memory stick "DIMM_A1" (channel:0 slot:1  page:0x2f1eb3 offset:0x446 grain:32 syndrome:0x0 1 error(s): Unknown: Err=0001:0090 socket=0 channel=0/mask=1 rank=5)
> >      
> > The word "error" appears *five* times on this line (once with a capital E).
> > I feel beaten, bruised and ready to give up on this machine with just one
> > actual error reported :-)
> 
> :)
> 
> Several of them come from the driver-provided details.
> 
> The edac-mc core contributes with "mc_error", "[Hardware Error]" and "Corrected error".
> The sb-edac driver contributes with "memory read error" and "1 error(s)".
> 
> We can get easily get rid of "[Hardware Error]" by removing HW_ERR from:
> 
> 	TP_printk(HW_ERR "mem_ctl#%d: %s error %s on memory stick \"%s\" (%s %s %s)",
> 
> replacing mc_error by something else is not hard, but this is the name of the trace call:
> 
> TRACE_EVENT(mc_error,
> ...
> 
> Maybe the better is to do s/mc_error/mc_event/g.

HW_ERR is the "official" prefix used by the MCE code in the kernel.
Maybe we can shorten it but it is needed to raise attention when staring
at dmesg output.

Now, since this tracepoint is not dmesg, we don't need it there at all
since we know that trace_mc_error reports memory errors.

"mc_error" is also not needed.

> The error count msg ("1 error(s)") could be replaced by "count:1".

Is there even a possibility to report more than one error when invoking
trace_mc_error once? If not, simply drop the count completely.

> > We could get rid of one by:
> >  s/Corrected error memory read error/Corrected memory read error/
> 
> This is the hardest possible solution ;) Changing it will cause weird messages
> all over EDAC drivers ;)

I agree with Tony here - repeating error a gazillion times on one report
only is a "naaah!"

Here's how it should look:

kworker/u:6-201   [007] .N..   161.136624: [Hardware Error]: memory read on memory stick "DIMM_A1" (type: corrected socket:0 mc:0 channel:0 slot:0 rank:1 page:0x586b6e offset:0xa66 grain:32 syndrome:0x0 channel_mask:1)

* count is gone
* MC-drivers shouldn't say "error" when reporting an error
* UE/CE moves into the brackets
* socket moves earlier in the brackets, and keep the whole deal hierarchical.
* drop "err_code" what is that?
* drop second "socket"
* drop "area" Area "DRAM" - are there other?
* what is "channel_mask"?
* move "rank" to earlier

Now this is an output format I can get on board with.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ