[<prev] [next>] [day] [month] [year] [list]
Message-ID: <1407678135.9689.4.camel@debian>
Date: Sun, 10 Aug 2014 21:42:15 +0800
From: Chen Yucong <slaoub@...il.com>
To: Tony Luck <tony.luck@...il.com>
Cc: "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: kill the current thread if MCG_STATUS_RIPV is not set
Hi Tony Luck,
According to the x86 ASDM vol.3A 15.9.3.2, we can find that
Recoverable-not-continuable SRAR Error (RIPV=0, EIPV=x) includes the
following two cases:
-IA32_MCG_STATUS.RIPV= 0, IA32_MCG_STATUS.EIPV=0, or
-IA32_MCG_STATUS.RIPV= 0, IA32_MCG_STATUS.EIPV=1.
For the first case, the MCE handler will directly panic the kernel
according the item of severities[]:
/* Neither return not error IP -- no chance to recover -> PANIC */
MCESEV(
PANIC, "Neither restart nor error IP",
MCGMASK(MCG_STATUS_RIPV|MCG_STATUS_EIPV, 0)
),
For the second case, the MCE handler should directly kill the current
thread according to the ASDM vol.3A 15.9.3.2:
The current executing thread cannot be continued. System software must
terminate the interrupted stream of execution and provide a new stream
of execution on return from the machine check handler for the affected
logical processor.
But the fact is that the MCE handler does not kill the current thread,
but rather to further handling(invoke memory_failure() by TIF_MCE_NOTIFY
).
I think I have been confused by the gap between documentation and source
code. Perhaps there may need a small fix.
thx!
cyc
Signed-off-by: Chen Yucong <slaoub@...il.com>
---
arch/x86/kernel/cpu/mcheck/mce.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c
b/arch/x86/kernel/cpu/mcheck/mce.c
index bd9ccda..3394494 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1055,9 +1055,12 @@ void do_machine_check(struct pt_regs *regs, long
error_code)
/*
* When no restart IP might need to kill or panic.
- * Assume the worst for now, but if we find the
- * severity is MCE_AR_SEVERITY we have other options.
+ * This indicates that the error is detected at the instruction
+ * pointer saved on the stack for this machine check exception
+ * and restarting execution with the interrupted context is not
+ * possible.(ASDM vol.3A 15.9.3.2)
*/
+
if (!(m.mcgstatus & MCG_STATUS_RIPV))
kill_it = 1;
@@ -1154,12 +1157,13 @@ void do_machine_check(struct pt_regs *regs, long
error_code)
if (cfg->tolerant < 3) {
if (no_way_out)
mce_panic("Fatal machine check on current CPU", &m, msg);
- if (worst == MCE_AR_SEVERITY) {
+
+ if (kill_it) {
+ force_sig(SIGBUS, current);
+ } else if (worst == MCE_AR_SEVERITY) {
/* schedule action before return to userland */
mce_save_info(m.addr, m.mcgstatus & MCG_STATUS_RIPV);
set_thread_flag(TIF_MCE_NOTIFY);
- } else if (kill_it) {
- force_sig(SIGBUS, current);
}
}
--
1.7.10.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists