linux-kernel - Re: x86/mce merge, integration hickup + crash, design thoughts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b3ece790901121402k4e84f8e7k7993b5fda90456cb@mail.gmail.com>
Date:	Mon, 12 Jan 2009 14:02:51 -0800
From:	"Tim Hockin" <thockin@...il.com>
To:	"Ingo Molnar" <mingo@...e.hu>
Cc:	"Andi Kleen" <ak@...ux.intel.com>,
	"Thomas Gleixner" <tglx@...utronix.de>,
	linux-kernel@...r.kernel.org, "H. Peter Anvin" <hpa@...or.com>,
	priyankag@...gle.com, "Aaron Durbin" <adurbin@...il.com>,
	"Duncan Laurie" <dlaurie@...gle.com>
Subject: Re: x86/mce merge, integration hickup + crash, design thoughts

On Sat, Dec 27, 2008 at 7:50 AM, Ingo Molnar <mingo@...e.hu> wrote:
>
> We really need to get rid of /dev/mcelog. It's a quirky binary logging
> facility not available on 32-bit on current kernels and it has a couple of
> limitations:

It's not quirky, it's structured.  Until and unless there is a better,
structured interface, this needs to stay.

>  - it squeezes all MCE errors from the whole system into a small,
>    32-entry ringbuffer.

Yes 32 is small, but not really a big problem in practice.  The MCE daemon
(http://mcedaemon.googlecode.com) goes a long way towards making this
a non-issue.  Every distro should ship mced.

>  - it puts all the MCE logging info into an intermediary binary log
>    record format: 'struct mce' - just for userspace to in essence
>    printf out those entries with minimal post-processing. The fact that
>    we squeeze all information into a fixed-size binary record makes it
>    hard to extend and complicates the code needlessly.

This is not true, we do EXTENSIVE post-processing of MCEs.  Yes it is hard
to extend, but that's not the same as saying it is useless.

>  - these design aspects are also quite harmful to usability: by
>    default all MCEs are fatal currently (pre-Nehalem anyway), so
>    /dev/mcelog will only be used if a user goes out on a limb to
>    configure it and sets the tolerant flag.

Not true.  The vast vast vast majority of MCEs are corrected, as per our
experience, and I suspect we've got more experience with MCEs than just
about any other consumer.

> A far more useful design for handling MCE events would be to feed them
> into printk logging. So instead of printing such rather cryptic error
> messages:
>
>   MCE 0
>   HARDWARE ERROR. This is *NOT* a software problem!
>   Please contact your hardware vendor
>   CPU 0 BANK 6 MISC 202d ADDR ffeef740
>   This is not a software problem!
>   Run through mcelog --ascii to decode and contact your hardware vendor
>
> and expecting people to run mcelog, we should print plain-text something
> like:
>
>   MCE 0
>   HARDWARE ERROR. This is *NOT* a software problem!
>   Please contact your hardware vendor
>   CPU 1 4 northbridge TSC 89a560bb249
>   ADDR 1dfa49690
>     Northbridge Chipkill ECC error
>     Chipkill ECC syndrome = 2021
>          bit46 = corrected ecc error
>     bus error 'local node response, request didn't time out
>         generic read mem transaction
>         memory access, level generic'
>   STATUS 9410c00020080a13 MCGSTATUS 0

NO!  This moves in the completely wrong direction.  This output is just
about impossible to parse sanely.  And don't get me started about how much
of a pain it is to encode all of the vendor-centric and per-platform
specifics so that it decodes all the useful bits.

I promise you that there are things that are useful to us that the kernel
will not parse and print.

> straight from the kernel. This means that the MCEs will make a lot more
> sense at a glance - and the user can figure out the suspected trouble
> area, without having to find some other box to run mcelog on, etc. We can
> eliminate the user-space mcelog utility/daemon component altogether - it
> buys us little but needless complexity and inflexibility.

I disagree totally.  Putting the logic for decoding MCEs into the kernel
adds a large amount of complex, hard to maintain code precisely where it
is most destructive.  Any sort of decoding of this data in-kernel is
probably a mistake (yes, I am looking at you EDAC).

It would be much cleaner to instead make the kernel really good at
publishing errors and let userspace decode them.  A naive end user won't
know the difference - they will see a message on the syslog or the
console.  Power users will be able to do better things.

> If we want to enable userspace to capture MCE events, then it must be done
> in a way that benefits the whole kernel, not just x86: a structured
> logging facility that is in essence a printk variant and is ASCII driven.

I would be fine with this (though I disagree that it *must* be ASCII).  I
don't care how the raw data comes out of the kernel, just that it is the
raw data, not some half-assedly parsed data.

> Such event sources should be discoverable, and only 'aware' printouts
> should go into this new facility (not all printks). Demultiplexing should
> be easy and well-defined.

I agree 1000% with this - it is something I have wanted for a long time,
and that some of my team are starting to look at again.

> I.e. we could use this opportunity of the MCE code unification to bring
> the code to the next level - and not prolongue to broken concepts of the
> past.

Don't dicard the past until the present can stand on it's own.  Please.

> I'd be glad to help out with any portion of this, it should be easy to
> solve and it will clearly improve the code. For .29 we could just do a raw
> printk based approach with no decoding just yet, and layer smart decoding
> and structured logging for .30.

Please leave /dev/mcelog as a CONFIG option until such time as an
appropriate replacement is *done*.  We rely on it.

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/