[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20230425121829.61755-1-xueshuai@linux.alibaba.com>
Date: Tue, 25 Apr 2023 20:18:29 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: bp@...en8.de, tony.luck@...el.com
Cc: tglx@...utronix.de, mingo@...hat.com, dave.hansen@...ux.intel.com,
x86@...nel.org, hpa@...or.com, baolin.wang@...ux.alibaba.com,
linux-edac@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: [PATCH] x86/mce/amd: init mce severity to handle deferred memory failure
When a deferred UE error is detected, e.g by background patrol scruber, it
will be handled in APIC interrupt handler amd_deferred_error_interrupt().
The handler will collect MCA banks, init mce struct and process it by
nofitying the registered MCE decode chain.
The uc_decode_notifier, one of MCE decode chain, will process memory
failure but only limit to MCE_AO_SEVERITY and MCE_DEFERRED_SEVERITY.
However, APIC interrupt handler does not init mce severity and the
uninitialized severity is 0 (MCE_NO_SEVERITY).
To handle the deferred memory failure case, init mce severity when logging
MCA banks.
Signed-off-by: Shuai Xue <xueshuai@...ux.alibaba.com>
---
Steps to reproduce:
step 1: inject a patrol scrub error by ras-tools
#einj_mem_uc patrol
step 2: check dmesg, no memory failure log
#dmesg -c
[51295.686806] mce: [Hardware Error]: Machine check events logged
[51295.693566] mce->status: 0x942031000400011b
[51295.698248] mce->misc: 0x00000000
[51295.701952] mce->severity: 0x00000000 # Manually added printk
[51295.726640] [Hardware Error]: Deferred error, no action required.
[51295.733448] [Hardware Error]: CPU:65 (19:11:1) MC21_STATUS[-|-|-|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0x942031000400011b
[51295.733452] [Hardware Error]: Error Addr: 0x0000000006350a00
[51295.733453] [Hardware Error]: PPIN: 0x02b69e294c148024
[51295.733453] [Hardware Error]: IPID: 0x0000109600250f00, Syndrome: 0x9a4a00000b800000
[51295.733455] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[51295.733463] mce: umc_normaddr_to_sysaddr: Invalid DramBaseAddress range: 0x0.
[51295.733471] EDAC MC0: 1 UE Cannot decode normalized address on mc#0csrow#0channel#2 (csrow:0 channel:2 page:0x0 offset:0x0 grain:64)
[51295.733471] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
After this fix:
[ 514.966892] mce: [Hardware Error]: Machine check events logged
[ 514.966912] mce->status: 0x942031000400011b
[ 514.978093] mce->misc: 0x00000000
[ 514.981796] mce->severity: 0x00000001
[ 514.985885] <uc_decode_notifier> pre_handler: p->addr = 0x00000000e09e69e4, ip = ffffffff8104b955, flags = 0x282
[ 514.997253] <uc_decode_notifier> post_handler: p->addr = 0x00000000e09e69e4, flags = 0x282
[ 515.006501] Memory failure: 0x5dc2: recovery action for free buddy page: Recovered
[ 515.015188] [Hardware Error]: Deferred error, no action required.
[ 515.022006] [Hardware Error]: CPU:67 (19:11:1) MC21_STATUS[-|-|-|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0x942031000400011b
[ 515.034440] [Hardware Error]: Error Addr: 0x0000000005dc2a00
[ 515.034442] [Hardware Error]: PPIN: 0x02b69e294c148024
[ 515.034443] [Hardware Error]: IPID: 0x0000109600650f00, Syndrome: 0x9a4a00000b800008
[ 515.034445] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[ 515.034453] umc_normaddr_to_sysaddr: Invalid DramBaseAddress range: 0x0.
[ 515.034458] EDAC MC1: 1 UE Cannot decode normalized address on mc#1csrow#0channel#6 (csrow:0 channel:6 page:0x0 offset:0x0 grain:64)
[ 515.034461] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Note, the memory_failure handles wrong physical address because
umc_normaddr_to_sysaddr fails. I don't figure out why it fails.
---
arch/x86/kernel/cpu/mce/amd.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 23c5072fbbb7..b5e1a27b0881 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -734,6 +734,7 @@ static void __log_error(unsigned int bank, u64 status, u64 addr, u64 misc)
m.misc = misc;
m.bank = bank;
m.tsc = rdtsc();
+ m.severity = mce_severity(&m, NULL, NULL, false);
if (m.status & MCI_STATUS_ADDRV) {
m.addr = addr;
--
2.20.1.12.g72788fdb
Powered by blists - more mailing lists