[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3908561D78D1C84285E8C5FCA982C28F32A5D502@ORSMSX114.amr.corp.intel.com>
Date: Thu, 9 Apr 2015 18:22:02 +0000
From: "Luck, Tony" <tony.luck@...el.com>
To: Borislav Petkov <bp@...en8.de>,
Naoya Horiguchi <n-horiguchi@...jp.nec.com>
CC: Ingo Molnar <mingo@...nel.org>,
Prarit Bhargava <prarit@...hat.com>,
Vivek Goyal <vgoyal@...hat.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Junichi Nomura <j-nomura@...jp.nec.com>,
Kiyoshi Ueda <k-ueda@...jp.nec.com>
Subject: RE: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
> Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> account for that, no?
>
> And if those are offlined, they're very very unlikely to trigger an MCE
> as they're idle and not executing code.
Let's step back a few feet and look at the big picture. There are three main classes of machine check
that we might see while trying to run kdump - an remember that all machine checks are currently
broadcast, so all cpus whether online or offline will see them
1) Fatal
We have to crash - lose the dump. Having a new machine check handler will make things a bit easier
to see what happened because we won't have any synchronization failed messages from the offline
cpus.
2) Execution path recoverable (SRAR in SDM parlance).
Also going to be fatal (kdump is all running in ring0, and we can't recover from errors in ring 0). Cleaner
messages as above. Potentially in the future we might be able to make the kdump machine check handler
actually recover by just skipping a page - if the location of the error was in the old kernel image.
3) Non-execution path recoverable (SRAO in SDM)
We ought to be able to keep kdump running if this happens - the "AO" stands for "action optional",
so we are going to choose to not take an action. Wherever the error was, it won't affect correctness
of execution of the current context.
-Tony
Powered by blists - more mailing lists