linux-kernel - Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150410004933.GA21978@hori1.linux.bs1.fc.nec.co.jp>
Date:	Fri, 10 Apr 2015 00:49:33 +0000
From:	Naoya Horiguchi <n-horiguchi@...jp.nec.com>
To:	Borislav Petkov <bp@...en8.de>
CC:	"Luck, Tony" <tony.luck@...el.com>, Ingo Molnar <mingo@...nel.org>,
	"Prarit Bhargava" <prarit@...hat.com>,
	Vivek Goyal <vgoyal@...hat.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Junichi Nomura <j-nomura@...jp.nec.com>,
	Kiyoshi Ueda <k-ueda@...jp.nec.com>
Subject: Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump

On Thu, Apr 09, 2015 at 09:05:51PM +0200, Borislav Petkov wrote:
> On Thu, Apr 09, 2015 at 06:22:02PM +0000, Luck, Tony wrote:
> > > Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> > > account for that, no?
> > >
> > > And if those are offlined, they're very very unlikely to trigger an MCE
> > > as they're idle and not executing code.
> > 
> > Let's step back a few feet and look at the big picture.  There are three main classes of machine check
> > that we might see while trying to run kdump - an remember that all machine checks are currently
> > broadcast, so all cpus whether online or offline will see them
> > 
> > 1) Fatal
> > We have to crash - lose the dump.  Having a new machine check handler will make things a bit easier
> > to see what happened because we won't have any synchronization failed messages from the offline
> > cpus.
> 
> But this should not be a problem if kdump path keeps cpu_online_mask
> uptodate. I'm looking at kdump_nmi_callback() or crash_nmi_callback() or
> so. Those should clear cpu_online_mask and then mce_start() will work
> fine on the crashing CPU.
> 
> IMHO, of course.

Sorry, I misread you. With clearing cpu_online_mask in shootdown (not done
yet,) raising tolerance should work without timeout message.
So I think you are right.

> > 2) Execution path recoverable (SRAR in SDM parlance).
> > Also going to be fatal (kdump is all running in ring0, and we can't recover from errors in ring 0). Cleaner
> > messages as above. Potentially in the future we might be able to make the kdump machine check handler
> > actually recover by just skipping a page - if the location of the error was in the old kernel image.
> > 
> > 3) Non-execution path recoverable (SRAO in SDM)
> > We ought to be able to keep kdump running if this happens - the "AO" stands for "action optional",
> > so we are going to choose to not take an action. Wherever the error was, it won't affect correctness
> > of execution of the current context.
> 
> Those could be simply made to go to dmesg during kdump, i.e. decouple
> any MCE consumers. And we do that now anyway, i.e. box without mcelog or
> some other ras daemon running.
> 
> So we could reuse the normal handler - we just need to do some tweaking
> first... AFAICT, of course. I believe in that endeavor, the devil will
> be in the detail.

OK, I'll try this approach with updating cpu_online_mask.

Thanks,
Naoya Horiguchi--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/