[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150409190550.GJ25434@pd.tnic>
Date: Thu, 9 Apr 2015 21:05:51 +0200
From: Borislav Petkov <bp@...en8.de>
To: "Luck, Tony" <tony.luck@...el.com>
Cc: Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
Ingo Molnar <mingo@...nel.org>,
Prarit Bhargava <prarit@...hat.com>,
Vivek Goyal <vgoyal@...hat.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Junichi Nomura <j-nomura@...jp.nec.com>,
Kiyoshi Ueda <k-ueda@...jp.nec.com>
Subject: Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
On Thu, Apr 09, 2015 at 06:22:02PM +0000, Luck, Tony wrote:
> > Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> > account for that, no?
> >
> > And if those are offlined, they're very very unlikely to trigger an MCE
> > as they're idle and not executing code.
>
> Let's step back a few feet and look at the big picture. There are three main classes of machine check
> that we might see while trying to run kdump - an remember that all machine checks are currently
> broadcast, so all cpus whether online or offline will see them
>
> 1) Fatal
> We have to crash - lose the dump. Having a new machine check handler will make things a bit easier
> to see what happened because we won't have any synchronization failed messages from the offline
> cpus.
But this should not be a problem if kdump path keeps cpu_online_mask
uptodate. I'm looking at kdump_nmi_callback() or crash_nmi_callback() or
so. Those should clear cpu_online_mask and then mce_start() will work
fine on the crashing CPU.
IMHO, of course.
> 2) Execution path recoverable (SRAR in SDM parlance).
> Also going to be fatal (kdump is all running in ring0, and we can't recover from errors in ring 0). Cleaner
> messages as above. Potentially in the future we might be able to make the kdump machine check handler
> actually recover by just skipping a page - if the location of the error was in the old kernel image.
>
> 3) Non-execution path recoverable (SRAO in SDM)
> We ought to be able to keep kdump running if this happens - the "AO" stands for "action optional",
> so we are going to choose to not take an action. Wherever the error was, it won't affect correctness
> of execution of the current context.
Those could be simply made to go to dmesg during kdump, i.e. decouple
any MCE consumers. And we do that now anyway, i.e. box without mcelog or
some other ras daemon running.
So we could reuse the normal handler - we just need to do some tweaking
first... AFAICT, of course. I believe in that endeavor, the devil will
be in the detail.
Thanks.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists