[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20081227155019.GA15493@elte.hu>
Date: Sat, 27 Dec 2008 16:50:19 +0100
From: Ingo Molnar <mingo@...e.hu>
To: Andi Kleen <ak@...ux.intel.com>,
Thomas Gleixner <tglx@...utronix.de>
Cc: linux-kernel@...r.kernel.org, "H. Peter Anvin" <hpa@...or.com>
Subject: x86/mce merge, integration hickup + crash, design thoughts
hi,
today i (belatedly ...) started looking into the status of the tip/x86/mce
branch, and merged it into tip/master as a first step.
firstly there's a small complication, it triggers this crash with the
attached config:
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
RCU-based detection of stalled CPUs is enabled.
------------[ cut here ]------------
Kernel BUG at c0b2a7e8 [verbose debug info unavailable]
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
last sysfs file:
Modules linked in:
Pid: 0, comm: swapper Not tainted (2.6.28-tip #12389)
EIP: 0060:[<c0b2a7e8>] EFLAGS: 00010002 CPU: 0
EIP is at native_init_IRQ+0x3a8/0x3d0
EAX: c0108e00 EBX: c0b1ffb8 ECX: c0ba8500 EDX: 00603b1c
ESI: c0b1ffb4 EDI: c0b14000 EBP: c0b1ffc4 ESP: c0b1ff64
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 0, ti=c0b1e000 task=c0a703c0 task.ti=c0b1e000)
Stack:
c0a74d20 c0b1ff70 c015547b c0b1ff94 00603b1c c0108e00 00603b50 c0108e00
00603ae8 c0108e00 00603a84 c0108e00 00603a50 c0108e00 00603a1c c0108e00
006039e8 c0108e00 006039b4 c0108e00 00603978 c0108e00 00000001 c0b673a0
Call Trace:
[<c015547b>] ? trace_hardirqs_off+0xb/0x10
[<c0108e00>] ? mwait_idle_with_hints+0x30/0x60
[<c0108e00>] ? mwait_idle_with_hints+0x30/0x60
[<c0108e00>] ? mwait_idle_with_hints+0x30/0x60
[<c0108e00>] ? mwait_idle_with_hints+0x30/0x60
[<c0108e00>] ? mwait_idle_with_hints+0x30/0x60
[<c0108e00>] ? mwait_idle_with_hints+0x30/0x60
[<c0108e00>] ? mwait_idle_with_hints+0x30/0x60
[<c0108e00>] ? mwait_idle_with_hints+0x30/0x60
[<c0108e00>] ? mwait_idle_with_hints+0x30/0x60
[<c0b266ef>] ? start_kernel+0x1af/0x300
[<c0b26220>] ? unknown_bootoption+0x0/0x220
[<c0b2606b>] ? i386_start_kernel+0x6b/0x80
Code: 18 f6 05 98 85 b1 c0 01 75 0f ba 20 23 a7 c0 b8 0d 00 00 00 e8 fa 4d 64 ff 83 c4 58 5b 5e 5d c3 8d 76 00 0f 0b eb fe 0f 0b eb fe <0f> 0b eb fe 0f 0b 89 f6 eb fc 0f 0b eb fe 0f 0b eb fe 0f 0b 8d
EIP: [<c0b2a7e8>] native_init_IRQ+0x3a8/0x3d0 SS:ESP 0068:c0b1ff64
---[ end trace 4eaa2a86a8e2da22 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
------------[ cut here ]------------
i've pushed out the broken branch here:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git tmp.x86.mce.broken
and have attached the bad config. I think this must be some integration
artifact - recent changes in the x86 tree interacting with the x86/mce
changes. Havent had time to track it down to a specific commit.
If this is fixed i'll merge it into tip/master and will push it out to
linux-next as well as the code looks good otherwise (with a few
reservations, see below).
I also looked at the code itself, and generally it's pretty nice. I have a
few general observations though that need to be addressed - i think these
can all be solved in this (elongated) merge window so that we can merge it
all into v2.6.29:
We really need to get rid of /dev/mcelog. It's a quirky binary logging
facility not available on 32-bit on current kernels and it has a couple of
limitations:
- it squeezes all MCE errors from the whole system into a small,
32-entry ringbuffer.
- it puts all the MCE logging info into an intermediary binary log
record format: 'struct mce' - just for userspace to in essence
printf out those entries with minimal post-processing. The fact that
we squeeze all information into a fixed-size binary record makes it
hard to extend and complicates the code needlessly.
- these design aspects are also quite harmful to usability: by
default all MCEs are fatal currently (pre-Nehalem anyway), so
/dev/mcelog will only be used if a user goes out on a limb to
configure it and sets the tolerant flag.
A far more useful design for handling MCE events would be to feed them
into printk logging. So instead of printing such rather cryptic error
messages:
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 6 MISC 202d ADDR ffeef740
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
and expecting people to run mcelog, we should print plain-text something
like:
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC 89a560bb249
ADDR 1dfa49690
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 2021
bit46 = corrected ecc error
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 9410c00020080a13 MCGSTATUS 0
straight from the kernel. This means that the MCEs will make a lot more
sense at a glance - and the user can figure out the suspected trouble
area, without having to find some other box to run mcelog on, etc. We can
eliminate the user-space mcelog utility/daemon component altogether - it
buys us little but needless complexity and inflexibility.
If we want to enable userspace to capture MCE events, then it must be done
in a way that benefits the whole kernel, not just x86: a structured
logging facility that is in essence a printk variant and is ASCII driven.
Such event sources should be discoverable, and only 'aware' printouts
should go into this new facility (not all printks). Demultiplexing should
be easy and well-defined.
I.e. we could use this opportunity of the MCE code unification to bring
the code to the next level - and not prolongue to broken concepts of the
past.
I'd be glad to help out with any portion of this, it should be easy to
solve and it will clearly improve the code. For .29 we could just do a raw
printk based approach with no decoding just yet, and layer smart decoding
and structured logging for .30.
Hm?
Ingo
View attachment "config-Sat_Dec_27_12_47_38_CET_2008.bad" of type "text/plain" (60003 bytes)
Powered by blists - more mailing lists