linux-kernel - Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150410040726.GB3623@hori1.linux.bs1.fc.nec.co.jp>
Date:	Fri, 10 Apr 2015 04:07:26 +0000
From:	Naoya Horiguchi <n-horiguchi@...jp.nec.com>
To:	Borislav Petkov <bp@...en8.de>
CC:	"Luck, Tony" <tony.luck@...el.com>, Ingo Molnar <mingo@...nel.org>,
	"Prarit Bhargava" <prarit@...hat.com>,
	Vivek Goyal <vgoyal@...hat.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Junichi Nomura <j-nomura@...jp.nec.com>,
	Kiyoshi Ueda <k-ueda@...jp.nec.com>
Subject: Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump

On Fri, Apr 10, 2015 at 12:49:33AM +0000, Horiguchi Naoya(堀口 直也) wrote:
> On Thu, Apr 09, 2015 at 09:05:51PM +0200, Borislav Petkov wrote:
> > On Thu, Apr 09, 2015 at 06:22:02PM +0000, Luck, Tony wrote:
> > > > Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> > > > account for that, no?
> > > >
> > > > And if those are offlined, they're very very unlikely to trigger an MCE
> > > > as they're idle and not executing code.
> > > 
> > > Let's step back a few feet and look at the big picture.  There are three main classes of machine check
> > > that we might see while trying to run kdump - an remember that all machine checks are currently
> > > broadcast, so all cpus whether online or offline will see them
> > > 
> > > 1) Fatal
> > > We have to crash - lose the dump.  Having a new machine check handler will make things a bit easier
> > > to see what happened because we won't have any synchronization failed messages from the offline
> > > cpus.
> > 
> > But this should not be a problem if kdump path keeps cpu_online_mask
> > uptodate. I'm looking at kdump_nmi_callback() or crash_nmi_callback() or
> > so. Those should clear cpu_online_mask and then mce_start() will work
> > fine on the crashing CPU.
> > 
> > IMHO, of course.
> 
> Sorry, I misread you. With clearing cpu_online_mask in shootdown (not done
> yet,) raising tolerance should work without timeout message.
> So I think you are right.

... wait, changing cpu_online_mask might confuse admins who try to
analyze the kdump, especially when the problems causing panic are CPU
related issues?

In the similar way, changing tolerant value loses the original value,
although this is unlikely to be a problem. But if we change it, using
an upper bit to keep lowest 2 bit to save the original value is better?