[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3908561D78D1C84285E8C5FCA982C28F3293FFBA@ORSMSX114.amr.corp.intel.com>
Date: Tue, 18 Nov 2014 18:30:55 +0000
From: "Luck, Tony" <tony.luck@...el.com>
To: Andy Lutomirski <luto@...capital.net>
CC: Borislav Petkov <bp@...en8.de>, Andi Kleen <andi@...stfloor.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
X86 ML <x86@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
Oleg Nesterov <oleg@...hat.com>
Subject: RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from
userspace
>> The lost cpu is *really* lost. Warm reset doesn't fix the machine, I usually
>> have to do a full power cycle.
> How is it even possible that I did that with a few lines of asm?
Probably not your directly your fault - some cascade of errors may have occurred.
> Could this be a hardware bug? Is there some condition that causes #MC
> delivery to wedge hard enough that even INIT/RESET stops working? Or
> possibly some CPU got stuck in SMM -- I have no idea what warm reset
> does these days.
I'm not even sure what kind of reset the remote management i/f I used
actually applied.
> Here's the patch to improve the timeout messages, but given the degree
> of wedgedness, I can guess what it'll say:
>
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/paranoid&id=e5cbd9d141bde651ecb20f0b65ad13bcef2468d0
Heh - I'd already put in some hacky printk()s to do similar. Mine aren't upstream quality, but do print the value of mce_callin/mce_executing
as appropriate. But I got some confusing results - reporter complained that only 142 of 144 had shown up. So two threads missing,
maybe means one core went into h/w shutdown. Need to dig further to see if the missing duo really are from the same core.
-Tony
Powered by blists - more mailing lists