linux-kernel - Re: [PATCH v2 1/2] x86: mce: kexec: turn off MCE in kexec

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Mon, 2 Mar 2015 02:31:19 +0000
From:	Naoya Horiguchi <n-horiguchi@...jp.nec.com>
To:	"Luck, Tony" <tony.luck@...el.com>
CC:	Borislav Petkov <bp@...en8.de>,
	Prarit Bhargava <prarit@...hat.com>,
	"Vivek Goyal" <vgoyal@...hat.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Junichi Nomura <j-nomura@...jp.nec.com>,
	Kiyoshi Ueda <k-ueda@...jp.nec.com>
Subject: Re: [PATCH v2 1/2] x86: mce: kexec: turn off MCE in kexec

On Fri, Feb 27, 2015 at 06:27:16PM +0000, Luck, Tony wrote:
> > When CR4.MCE=0b and an MCE happens, it will shutdown the system, at
> > least on Intel, according to Tony
> 
> I checked with the architects ... and I was right. If you clear CR4.MCE you'll still
> see the machine check - and you'll pull the big system reset lever.

Thank you for confirmation.

> If you think the other cpus can survive the reset - then the right thing to do is to
> have any offline cpus that show up in the machine check handler just clear MCG_STATUS
> and return:
> 
> do_machine_check()
> {
> 	/* offline cpus may show up for the party - but don't need to do anything here - send them back home */
> 	if (!(cpu_online(smp_processor_id())) {
> 		mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> 		return;
> 	}

It seems that kdump shootdown doesn't clear online CPU's cpumask, so this
cpu_online() check doesn't work to this (kdump-specific) problem.
But I think the checking the number of online CPUs for MCE synchronization is
generally correct for other contexts (like MCE under CPU hotremoved system?),
so worth doing in another patch.

> If we are crashing because of a machine check - I wonder how useful it is to run kdump().  There are a very
> small set of ways that you can induce a machine check from program action - normally the problem is that
> something bad happened in the h/w ... a kdump will just fill your disk and waste your time looking at what
> the s/w was dong when the machine check happened.

I don't think every MCE always makes the server inoperative. One good example
is uncorrected errors (including SRAO and SRAR).

And please note that the target of this patch is an MCE when the kernel is
already running on kdump code (so crashing happened *not* because of the MCE).
In that case, we can expect that kdump works fine if the MCE hits the "kdump
shotdown" CPU which are just running cpu_relax() loop, because a 2nd kernel's
CPU isn't affected by the MCE (even the CPU failure is fatal one.)

If a fatal MCE happens on the CPU running kdump code, there's no reason to
try harder to get kdump as you pointed out. In such case, what we can do is
to print out a message like "kdump failed due to MCE" and reset the system.

Thanks,
Naoya Horiguchi--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/