linux-kernel - Re: [UNTESTED PATCH] x86, mce: Avoid double entry of deferred errors into the genpool.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20151123175932.GG5134@pd.tnic>
Date:	Mon, 23 Nov 2015 18:59:32 +0100
From:	Borislav Petkov <bp@...en8.de>
To:	"Luck, Tony" <tony.luck@...el.com>
Cc:	"Chen, Gong" <gong.chen@...ux.intel.com>,
	"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [UNTESTED PATCH] x86, mce: Avoid double entry of deferred errors
 into the genpool.

On Thu, Nov 19, 2015 at 09:39:20PM +0100, Borislav Petkov wrote:
> On Thu, Nov 19, 2015 at 07:33:58PM +0000, Luck, Tony wrote:
> > > Applied, thanks.
> > 
> > Did you test it (note the "UNTESTED" in the subject!).  My usual system for this is getting upgrades and being
> > flaky at the moment.
> 
> Bah, it builds, should be enough. Ship it. :-)
> 
> Lemme get a box...

Here some results:

# grep . /sys/kernel/debug/apei/einj/*
/sys/kernel/debug/apei/einj/available_error_type:0x00000002     Processor Uncorrectable non-fatal
/sys/kernel/debug/apei/einj/available_error_type:0x00000008     Memory Correctable
/sys/kernel/debug/apei/einj/available_error_type:0x00000010     Memory Uncorrectable non-fatal
grep: /sys/kernel/debug/apei/einj/error_inject: Permission denied
/sys/kernel/debug/apei/einj/error_type:0x0

Looks like some old EINJ without all the features. Oh well, let's see
what'll happen anyway:

# echo 0x8 > error_type
# echo 1 > error_inject

[  840.461666] mce: [Hardware Error]: Machine check events logged
[  840.476221] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[  840.489214] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 5: 8c00004000010090
[  840.507685] EDAC sbridge MC0: TSC 0 
[  840.515223] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86 
[  840.532477] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1448299322 SOCKET 0 APIC 0
[  840.551279] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[  840.563872] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8800004100800090
[  840.581970] EDAC sbridge MC0: TSC 0 
[  840.589513] EDAC sbridge MC0: ADDR 0 EDAC sbridge MC0: MISC 4908400040004200 
[  840.606267] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1448299322 SOCKET 0 APIC 0
[  841.499090] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)

So yeah, mce_notify_irq() is visible there, i.e. we did mce_log() here
which sets mce_need_notify.

# echo 0x2 > error_type
# echo 1 > error_inject
bash: echo: write error: Invalid argument
[  885.272000] [Firmware Warn]: APEI: Invalid action table, unknown instruction type: 5

ACPI_EINJ_FLUSH_CACHELINE??

Yeah, we're missing some functionality.

# echo 0x10 > error_type
# echo 1 > error_inject

That went BOOM:

[ 1296.233435] Disabling lock debugging due to kernel taint
[ 1296.248010] mce: [Hardware Error]: CPU 6: Machine Check Exception: 5 Bank 5: be00000000010090
[ 1296.269245] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8136260f> {intel_idle+0xbf/0x130}
[ 1296.290735] mce: [Hardware Error]: TSC 37c1fb53beb ADDR bb68f400 MISC 20401a9a86 
[ 1296.309772] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1448299778 SOCKET 0 APIC c microcode 710
[ 1296.332058] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 1296.346094] EDAC sbridge MC0: CPU 6: Machine Check Exception: 5 Bank 5: be00000000010090
[ 1296.366517] EDAC sbridge MC0: TSC 37c1fb53beb 
[ 1296.375974] EDAC sbridge MC0: ADDR bb68f400 EDAC sbridge MC0: MISC 20401a9a86 
[ 1296.394493] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1448299778 SOCKET 0 APIC c
[ 1296.416153] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68f offset:
0x400 grain:32 -  area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
...

judging by the CPU numbers, looks like node 0 got that error in the shared bank:

.... node  #0, CPUs:          #1   #2   #3   #4   #5   #6   #7
.... node  #0, CPUs:    #32  #33  #34  #35  #36  #37  #38  #39

finishing with

[ 1299.907994] mce: [Hardware Error]: Machine check: Processor context corrupt
[ 1299.926783] Kernel panic - not syncing: Fatal machine check
[ 1299.959632] Kernel Offset: disabled
[ 1299.984254] Rebooting in 100 seconds..

dont_log_ce:

$ for i in $(seq 0 63); do echo 1 >  /sys/devices/system/machinecheck/machinecheck$i/dont_log_ce; cat /sys/devices/system/machinecheck/machinecheck$i/dont_log_ce; done | uniq
1

# echo 0x8 > error_type
# echo 1 > error_inject

[  318.263797] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[  318.277029] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 5: 8c00004000010090
[  318.295631] EDAC sbridge MC0: TSC 0 
[  318.303143] EDAC sbridge MC0: ADDR bb68f000 EDAC sbridge MC0: MISC 2040262686 
[  318.320473] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1448300397 SOCKET 0 APIC 0
[  318.809112] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68f offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)

This looks ok, we're missing the mce_notify_irq() line "mce: [Hardware
Error]: Machine check events logged" which is as expected but the EDAC
lines are there because we sent the error on the notify chain.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/