linux-kernel - [PATCH RFC 0/2] Hardware Anomaly Report Mechanism (HARM)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <20110324173259.64a30b0b@pedra>
Date:	Thu, 24 Mar 2011 17:32:59 -0300
From:	Mauro Carvalho Chehab <mchehab@...hat.com>
To:	unlisted-recipients:; (no To-header on input)
Cc:	Linux Edac Mailing List <linux-edac@...r.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Tony Luck <tony.luck@...el.com>,
	Borislav Petkov <borislav.petkov@....com>
Subject: [PATCH RFC 0/2] Hardware Anomaly Report Mechanism (HARM)

Those RFC patches are meant to match the target of unifying the several 
hardware event mechanisms found on Linux Kernel into one. Specifically, 
they are meant to write a replacement mechanism to report the errors 
covered by both EDAC and MCE log event mechanisms into an unified way
via the perf/trace subsystem.

It is the first concrete result of the EDAC/MCE mini-summit and the 
Hardware Error report BoF that happened during LPC/2010.

For now, only the EDAC traces were mapped, as a proof of concept. If 
this way is OK, Tony should start working on MCE part, for Intel 
devices. 

AMD MCE driver is already reporting MCE errors as events, but it is just 
replicating the way mcelog does. So, I think we'll need further 
discussions in order to migrate the trace events into something more
palatable to the end users (e. g. decoding the error events inside
the kernel).

As a general rule, all events provide a log like:

mce#0: Corrected Error <foo> at label "bar" (some tech info)

The information before the parenthesis specify the type of the error and 
the silk screen label of the affected device (like "DIMM 1"). So, for
the system admin to recover a machine that have too many errors, all
he needs to do is to replace DIMM 1.

The information inside parenthesis are the ones that have meaning to the 
OEM provider (grain, syndrome, row, channel, etc).

TODO:

- Use the same mechanism for MCE;

- Have some userspace daemon to collect those events and distribute to
  syslog, remote consoles, network management systems, etc;

- Have persistence to avoid loosing events between the start of collect
  and the start of something monitoring them.

Those patches compile fine, but I was not able to test the event collect
on the second patch, as I'm currently having some troubles to inject
 errors on my hardware, probably due to a BIOS upgrade. I'm currently
 working on it, so I'll post a version 2 if needed, after testing it.

It makes sense to apply the first patch as soon as possible and send it
upstream, as it just moves some EDAC structures to include/linux/edac.h,
where they could be used also by the HARM mechanism. There's no functional
changes on it, and not applying would mean the need of rebase it if
some change happens at EDAC MCI structures.

Mauro Carvalho Chehab (2):
  edac: Move edac main structs to include/linux/edac.h
  events/hw_event: Create a Hardware Anomaly Report Mecanism (HARM)

 drivers/edac/edac_core.h        |  354 +--------------------------------------
 drivers/edac/edac_mc.c          |   32 ++++
 include/linux/edac.h            |  354 +++++++++++++++++++++++++++++++++++++++
 include/trace/events/hw_event.h |  322 +++++++++++++++++++++++++++++++++++
 4 files changed, 709 insertions(+), 353 deletions(-)
 create mode 100644 include/trace/events/hw_event.h

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/