[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150410004933.GA21978@hori1.linux.bs1.fc.nec.co.jp>
Date: Fri, 10 Apr 2015 00:49:33 +0000
From: Naoya Horiguchi <n-horiguchi@...jp.nec.com>
To: Borislav Petkov <bp@...en8.de>
CC: "Luck, Tony" <tony.luck@...el.com>, Ingo Molnar <mingo@...nel.org>,
"Prarit Bhargava" <prarit@...hat.com>,
Vivek Goyal <vgoyal@...hat.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Junichi Nomura <j-nomura@...jp.nec.com>,
Kiyoshi Ueda <k-ueda@...jp.nec.com>
Subject: Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
On Thu, Apr 09, 2015 at 09:05:51PM +0200, Borislav Petkov wrote:
> On Thu, Apr 09, 2015 at 06:22:02PM +0000, Luck, Tony wrote:
> > > Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> > > account for that, no?
> > >
> > > And if those are offlined, they're very very unlikely to trigger an MCE
> > > as they're idle and not executing code.
> >
> > Let's step back a few feet and look at the big picture. There are three main classes of machine check
> > that we might see while trying to run kdump - an remember that all machine checks are currently
> > broadcast, so all cpus whether online or offline will see them
> >
> > 1) Fatal
> > We have to crash - lose the dump. Having a new machine check handler will make things a bit easier
> > to see what happened because we won't have any synchronization failed messages from the offline
> > cpus.
>
> But this should not be a problem if kdump path keeps cpu_online_mask
> uptodate. I'm looking at kdump_nmi_callback() or crash_nmi_callback() or
> so. Those should clear cpu_online_mask and then mce_start() will work
> fine on the crashing CPU.
>
> IMHO, of course.
Sorry, I misread you. With clearing cpu_online_mask in shootdown (not done
yet,) raising tolerance should work without timeout message.
So I think you are right.
> > 2) Execution path recoverable (SRAR in SDM parlance).
> > Also going to be fatal (kdump is all running in ring0, and we can't recover from errors in ring 0). Cleaner
> > messages as above. Potentially in the future we might be able to make the kdump machine check handler
> > actually recover by just skipping a page - if the location of the error was in the old kernel image.
> >
> > 3) Non-execution path recoverable (SRAO in SDM)
> > We ought to be able to keep kdump running if this happens - the "AO" stands for "action optional",
> > so we are going to choose to not take an action. Wherever the error was, it won't affect correctness
> > of execution of the current context.
>
> Those could be simply made to go to dmesg during kdump, i.e. decouple
> any MCE consumers. And we do that now anyway, i.e. box without mcelog or
> some other ras daemon running.
>
> So we could reuse the normal handler - we just need to do some tweaking
> first... AFAICT, of course. I believe in that endeavor, the devil will
> be in the detail.
OK, I'll try this approach with updating cpu_online_mask.
Thanks,
Naoya Horiguchi--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists