lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 15 Oct 2013 11:28:51 +0200
From:	Borislav Petkov <bp@...en8.de>
To:	Chen Gong <gong.chen@...ux.intel.com>
Cc:	tony.luck@...el.com, linux-kernel@...r.kernel.org,
	linux-acpi@...r.kernel.org
Subject: Re: Extended H/W error log driver

On Tue, Oct 15, 2013 at 12:07:31AM -0400, Chen Gong wrote:
> Some errors have multiple sub sections like below:
> 
> [ 1442.070522] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
> [ 1442.070528] {2}[Hardware Error]: event severity: corrected
> [ 1442.070531] {2}[Hardware Error]: sub_event[0], severity: corrected
> [ 1442.070534] {2}[Hardware Error]: section_type: memory error
> [ 1442.070537] {2}[Hardware Error]: error_status: 0x0000000000000000
> [ 1442.070539] {2}[Hardware Error]: sub_event[1], severity: corrected
> [ 1442.070541] {2}[Hardware Error]: section_type: memory error
> [ 1442.070543] {2}[Hardware Error]: error_status: 0x0000000000000000

Right, and what do those sub sections mean to the user? Did we have
multiple errors?

It looks like this because we have memory errors section type but it is
not very telling. How about:


[ 1442.070522] {2}[Hardware Error]: APEI GHES id 0: Hardware errors logged
[ 1442.070528] {2}[Hardware Error]: event severity: corrected
[ 1442.070534] {2}[Hardware Error]:  Error 0, type: corrected memory error.
[ 1442.070537] {2}[Hardware Error]:   error_status: 0x0000000000000000
[ 1442.070539] {2}[Hardware Error]:  Error 1, type: corrected memory error.
[ 1442.070543] {2}[Hardware Error]:   error_status: 0x0000000000000000

I think this is much more human readable and understandable :-)

We can even add a hint for the user like:

	"Above errors have been corrected by the hardware and require no further action."

Btw, this is valid for both dmesg and trace event output.

Because from my experience so far people just scream: "Look, I just had
an MCE" withot even reading what it says. And this just upsets support
people for no valid reason at all.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ