linux-kernel - RE: MCE Bug?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3908561D78D1C84285E8C5FCA982C28F32A9E177@ORSMSX114.amr.corp.intel.com>
Date:	Wed, 17 Jun 2015 23:53:53 +0000
From:	"Luck, Tony" <tony.luck@...el.com>
To:	Borislav Petkov <bp@...e.de>, "Wang, Rui Y" <rui.y.wang@...el.com>
CC:	"Chen, Gong" <gong.chen@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: MCE Bug?

> if you want to give those changes a run, I've uploaded them here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git#tip-ras

Latest experiments show that sometimes checking kventd_up() before calling schedule_work()
helps ... but mostly only when I fake some early logs from low numbered  cpus.  I added some
traces to the real case of a left-over fatal error and got this splat:

[    0.331551] smpboot: CPU0: Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz (fam: 06, model: 3f, stepping: 04)
[    0.342117] Performance Events: PEBS fmt2+, 16-deep LBR, Haswell events, full-width counters, Intel PMU driver.
[    0.353471] ... version:                3
[    0.357948] ... bit width:              48
[    0.362523] ... generic registers:      4
[    0.367000] ... value mask:             0000ffffffffffff
[    0.372935] ... max period:             0000ffffffffffff
[    0.378870] ... fixed-purpose events:   3
[    0.383347] ... event mask:             000000070000000f
[    0.392357] x86: Booting SMP configuration:
[    0.397031] .... node  #0, CPUs:          #1
[    0.423373] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
[    0.432705]    #2   #3   #4   #5   #6   #7   #8   #9  #10  #11  #12  #13  #14  #15  #16  #17
[    0.706878] .... node  #1, CPUs:    #18  #19  #20  #21  #22  #23  #24  #25  #26  #27  #28  #29  #30  #31  #32  #33  #34  #35
[    1.094625] .... node  #2, CPUs:    #36
[    1.112958] mcelog: cpu 36 bank 8 status be00000000010090
[    1.119201] mcelog() stashed at entry=0
[    1.203602] mce: [Hardware Error]: Machine check events logged
[    1.220313]   #37
[    1.220412] BUG: unable to handle kernel 
[    1.226954]   #38
[    1.229107] NULL pointer dereference at 0000000000000008
[    1.235052] IP: [<ffffffff810980a1>] process_one_work+0x31/0x420
[    1.236829]   #39PGD 0 
[    1.244558] Oops: 0000 [#1] SMP 
[    1.248189] Modules linked in:
[    1.251617] CPU: 36 PID: 263 Comm: kworker/36:0 Not tainted 4.1.0-rc8 #9
[    1.259100]   #40
[    1.259100] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0065.R01.1505011640 05/01/2015
[    1.272832]   #41
[    1.272833] task: ffff88181c1f4470 ti: ffff88181bd24000 task.ti: ffff88181bd24000
[    1.283350] RIP: 0010:[<ffffffff810980a1>] [    1.286433]   #42
 [<ffffffff810980a1>] process_one_work+0x31/0x420
[    1.294976] RSP: 0000:ffff88181bd27e08  EFLAGS: 00010046

I.e. we die on the first attempt to log ... but that attempt is a long way into bringing up all the cpus. 
CPU#36 is the first one from socket2 (counting 0, 1, 2, 3).

-Tony