lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 22 Feb 2012 11:43:24 +0100
From:	Borislav Petkov <bp@...64.org>
To:	"Luck, Tony" <tony.luck@...el.com>
Cc:	Borislav Petkov <bp@...64.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Mauro Carvalho Chehab <mchehab@...hat.com>,
	Ingo Molnar <mingo@...e.hu>,
	edac-devel <linux-edac@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: RAS trace event proto

On Wed, Feb 22, 2012 at 12:58:37AM +0000, Luck, Tony wrote:
> I'm also struggling to understand an end-user use case where you would
> want filtering.  Mauro - can you expand a bit on why someone would just
> want to see the errors from memory controller 1?
> 
> My mental model of the world is that large systems have some background
> noise - a trickle of corrected errors that happen in normal operation.
> User shouldn't care about these errors unless they breach some threshold.
> 
> When something goes wrong, you may see a storm of corrected errors, or
> some uncorrected errors. In either case you'd like to get as much information
> as possible to identify the component that is at fault. I'd definitely like
> to see some structure to the error reporting, so that mining for data patterns
> in a storm isn't hideously platform dependent.

Yep, I'm on the same page here.

> It might be easier to evaluate the competing ideas here with some sample
> output in addition to the code.

Well, to clarify:

When you get a decoded error, you get the same format as what you get in
dmesg, for example:

[ 2666.646070] [Hardware Error]: CPU:64   MC1_STATUS[-|CE|MiscV|PCC|-|CECC]: 0x9a05c00007010011
[ 2666.655003] [Hardware Error]: Instruction Cache Error: L1 TLB multimatch.
[ 2666.655008] [Hardware Error]: cache level: L1, tx: INSN

And with the decoded string tracepoint, that thing above is a single
string. If you use trace_mce_record(), you still can get the single
MCE fields which we carry to userspace from struct mce, _in addition_.
The hypothetical problem is for userspace not being able to use the
tracepoint format to parse reported fields easily and in an unambiguous
manner. Instead, it gets a single string which, I admit, is not that
pretty.

Now, the problem is if we want to use a single tracepoint for all errors
- it is unfeasible that any fields sharing can be done there except
maybe the TSC stamp when it happened, the CPU that caught it and etc.
not so important details.

IOW, the error format is different for each error type, almost, and
there's no marrying between them. OTOH, ff we start adding tracepoints
for each error type, we'll hit the other end - bloat. So also a no-no.

Maybe the compromise would be to define a single tracepoint per
_hardware_ error reporting scheme. That is, MCA has an own tracepoint,
PCIE AER has its own error reporting tracepoint, then there's an EDAC
!x86 one which doesn't use MCA for reporting and also any other scheme a
hw vendor would come up with...

This will keep the bloat level to a minimum, keep the TPs apart and
hopefully make all of us happy :).

Opinions?


-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ