linux-kernel - x86/mce merge, integration hickup + crash, design thoughts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20081227155019.GA15493@elte.hu>
Date:	Sat, 27 Dec 2008 16:50:19 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	Andi Kleen <ak@...ux.intel.com>,
	Thomas Gleixner <tglx@...utronix.de>
Cc:	linux-kernel@...r.kernel.org, "H. Peter Anvin" <hpa@...or.com>
Subject: x86/mce merge, integration hickup + crash, design thoughts


hi,

today i (belatedly ...) started looking into the status of the tip/x86/mce 
branch, and merged it into tip/master as a first step.

firstly there's a small complication, it triggers this crash with the 
attached config:

  Enabling fast FPU save and restore... done.
  Enabling unmasked SIMD FPU exception support... done.
  Initializing CPU#0
  RCU-based detection of stalled CPUs is enabled.
  ------------[ cut here ]------------
  Kernel BUG at c0b2a7e8 [verbose debug info unavailable]
  invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
  last sysfs file: 
  Modules linked in:
  
  Pid: 0, comm: swapper Not tainted (2.6.28-tip #12389) 
  EIP: 0060:[<c0b2a7e8>] EFLAGS: 00010002 CPU: 0
  EIP is at native_init_IRQ+0x3a8/0x3d0
  EAX: c0108e00 EBX: c0b1ffb8 ECX: c0ba8500 EDX: 00603b1c
  ESI: c0b1ffb4 EDI: c0b14000 EBP: c0b1ffc4 ESP: c0b1ff64
   DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
  Process swapper (pid: 0, ti=c0b1e000 task=c0a703c0 task.ti=c0b1e000)
  Stack:
   c0a74d20 c0b1ff70 c015547b c0b1ff94 00603b1c c0108e00 00603b50 c0108e00
   00603ae8 c0108e00 00603a84 c0108e00 00603a50 c0108e00 00603a1c c0108e00
   006039e8 c0108e00 006039b4 c0108e00 00603978 c0108e00 00000001 c0b673a0
  Call Trace:
   [<c015547b>] ? trace_hardirqs_off+0xb/0x10
   [<c0108e00>] ? mwait_idle_with_hints+0x30/0x60
   [<c0108e00>] ? mwait_idle_with_hints+0x30/0x60
   [<c0108e00>] ? mwait_idle_with_hints+0x30/0x60
   [<c0108e00>] ? mwait_idle_with_hints+0x30/0x60
   [<c0108e00>] ? mwait_idle_with_hints+0x30/0x60
   [<c0108e00>] ? mwait_idle_with_hints+0x30/0x60
   [<c0108e00>] ? mwait_idle_with_hints+0x30/0x60
   [<c0108e00>] ? mwait_idle_with_hints+0x30/0x60
   [<c0108e00>] ? mwait_idle_with_hints+0x30/0x60
   [<c0b266ef>] ? start_kernel+0x1af/0x300
   [<c0b26220>] ? unknown_bootoption+0x0/0x220
   [<c0b2606b>] ? i386_start_kernel+0x6b/0x80
  Code: 18 f6 05 98 85 b1 c0 01 75 0f ba 20 23 a7 c0 b8 0d 00 00 00 e8 fa 4d 64 ff 83 c4 58 5b 5e 5d c3 8d 76 00 0f 0b eb fe 0f 0b eb fe <0f> 0b eb fe 0f 0b 89 f6 eb fc 0f 0b eb fe 0f 0b eb fe 0f 0b 8d 
  EIP: [<c0b2a7e8>] native_init_IRQ+0x3a8/0x3d0 SS:ESP 0068:c0b1ff64
  ---[ end trace 4eaa2a86a8e2da22 ]---
  Kernel panic - not syncing: Attempted to kill the idle task!
  ------------[ cut here ]------------

i've pushed out the broken branch here:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git tmp.x86.mce.broken

and have attached the bad config. I think this must be some integration 
artifact - recent changes in the x86 tree interacting with the x86/mce 
changes. Havent had time to track it down to a specific commit.

If this is fixed i'll merge it into tip/master and will push it out to 
linux-next as well as the code looks good otherwise (with a few 
reservations, see below).

I also looked at the code itself, and generally it's pretty nice. I have a 
few general observations though that need to be addressed - i think these 
can all be solved in this (elongated) merge window so that we can merge it 
all into v2.6.29:

We really need to get rid of /dev/mcelog. It's a quirky binary logging 
facility not available on 32-bit on current kernels and it has a couple of 
limitations:

  - it squeezes all MCE errors from the whole system into a small,
    32-entry ringbuffer. 

  - it puts all the MCE logging info into an intermediary binary log 
    record format: 'struct mce' - just for userspace to in essence
    printf out those entries with minimal post-processing. The fact that 
    we squeeze all information into a fixed-size binary record makes it 
    hard to extend and complicates the code needlessly.

  - these design aspects are also quite harmful to usability: by
    default all MCEs are fatal currently (pre-Nehalem anyway), so 
    /dev/mcelog will only be used if a user goes out on a limb to 
    configure it and sets the tolerant flag.

A far more useful design for handling MCE events would be to feed them 
into printk logging. So instead of printing such rather cryptic error 
messages:

   MCE 0
   HARDWARE ERROR. This is *NOT* a software problem!
   Please contact your hardware vendor
   CPU 0 BANK 6 MISC 202d ADDR ffeef740
   This is not a software problem!
   Run through mcelog --ascii to decode and contact your hardware vendor

and expecting people to run mcelog, we should print plain-text something 
like:

   MCE 0
   HARDWARE ERROR. This is *NOT* a software problem!
   Please contact your hardware vendor
   CPU 1 4 northbridge TSC 89a560bb249
   ADDR 1dfa49690
     Northbridge Chipkill ECC error
     Chipkill ECC syndrome = 2021
          bit46 = corrected ecc error
     bus error 'local node response, request didn't time out
         generic read mem transaction
         memory access, level generic'
   STATUS 9410c00020080a13 MCGSTATUS 0

straight from the kernel. This means that the MCEs will make a lot more 
sense at a glance - and the user can figure out the suspected trouble 
area, without having to find some other box to run mcelog on, etc. We can 
eliminate the user-space mcelog utility/daemon component altogether - it 
buys us little but needless complexity and inflexibility.

If we want to enable userspace to capture MCE events, then it must be done 
in a way that benefits the whole kernel, not just x86: a structured 
logging facility that is in essence a printk variant and is ASCII driven. 

Such event sources should be discoverable, and only 'aware' printouts 
should go into this new facility (not all printks). Demultiplexing should 
be easy and well-defined.

I.e. we could use this opportunity of the MCE code unification to bring 
the code to the next level - and not prolongue to broken concepts of the 
past.

I'd be glad to help out with any portion of this, it should be easy to 
solve and it will clearly improve the code. For .29 we could just do a raw 
printk based approach with no decoding just yet, and layer smart decoding 
and structured logging for .30.

Hm?

	Ingo

View attachment "config-Sat_Dec_27_12_47_38_CET_2008.bad" of type "text/plain" (60003 bytes)