linux-kernel - Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20150428084159.GH15033@dhcp-16-116.nay.redhat.com>
Date:	Tue, 28 Apr 2015 16:41:59 +0800
From:	Baoquan He <bhe@...hat.com>
To:	Naoya Horiguchi <n-horiguchi@...jp.nec.com>
Cc:	Borislav Petkov <bp@...en8.de>, "Luck, Tony" <tony.luck@...el.com>,
	Ingo Molnar <mingo@...nel.org>,
	Prarit Bhargava <prarit@...hat.com>,
	Vivek Goyal <vgoyal@...hat.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Junichi Nomura <j-nomura@...jp.nec.com>,
	Kiyoshi Ueda <k-ueda@...jp.nec.com>
Subject: Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump

On 04/10/15 at 12:49am, Naoya Horiguchi wrote:
> On Thu, Apr 09, 2015 at 09:05:51PM +0200, Borislav Petkov wrote:
> > On Thu, Apr 09, 2015 at 06:22:02PM +0000, Luck, Tony wrote:
> > > > Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> > > > account for that, no?
> > > >
> > > > And if those are offlined, they're very very unlikely to trigger an MCE
> > > > as they're idle and not executing code.
> > > 
> > > Let's step back a few feet and look at the big picture.  There are three main classes of machine check
> > > that we might see while trying to run kdump - an remember that all machine checks are currently
> > > broadcast, so all cpus whether online or offline will see them
> > > 
> > > 1) Fatal
> > > We have to crash - lose the dump.  Having a new machine check handler will make things a bit easier
> > > to see what happened because we won't have any synchronization failed messages from the offline
> > > cpus.
> > 
> > But this should not be a problem if kdump path keeps cpu_online_mask
> > uptodate. I'm looking at kdump_nmi_callback() or crash_nmi_callback() or
> > so. Those should clear cpu_online_mask and then mce_start() will work
> > fine on the crashing CPU.
> > 
> > IMHO, of course.
> 
> Sorry, I misread you. With clearing cpu_online_mask in shootdown (not done
> yet,) raising tolerance should work without timeout message.
> So I think you are right.

Hi Naoya,

Thanks for great efforts you have made on this issue.

I am trying to catch up, and have read mails in this thread.
Please also add me to CC list when you post a new version. I would like to
review it.

Thanks
Baoquan

> 
> > > 2) Execution path recoverable (SRAR in SDM parlance).
> > > Also going to be fatal (kdump is all running in ring0, and we can't recover from errors in ring 0). Cleaner
> > > messages as above. Potentially in the future we might be able to make the kdump machine check handler
> > > actually recover by just skipping a page - if the location of the error was in the old kernel image.
> > > 
> > > 3) Non-execution path recoverable (SRAO in SDM)
> > > We ought to be able to keep kdump running if this happens - the "AO" stands for "action optional",
> > > so we are going to choose to not take an action. Wherever the error was, it won't affect correctness
> > > of execution of the current context.
> > 
> > Those could be simply made to go to dmesg during kdump, i.e. decouple
> > any MCE consumers. And we do that now anyway, i.e. box without mcelog or
> > some other ras daemon running.
> > 
> > So we could reuse the normal handler - we just need to do some tweaking
> > first... AFAICT, of course. I believe in that endeavor, the devil will
> > be in the detail.
> 
> OK, I'll try this approach with updating cpu_online_mask.
> 
> Thanks,
> Naoya Horiguchi--
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/