[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53A45627.6090306@oracle.com>
Date: Fri, 20 Jun 2014 11:41:27 -0400
From: Boris Ostrovsky <boris.ostrovsky@...cle.com>
To: Borislav Petkov <bp@...en8.de>
CC: tony.luck@...el.com, linux-kernel@...r.kernel.org,
linux-edac@...r.kernel.org, mattieu.souchaud@...e.fr
Subject: Re: [PATCH] x86/mce: Don't unregister CPU hotplug notifier in error
path
On 06/20/2014 11:23 AM, Borislav Petkov wrote:
> On Fri, Jun 20, 2014 at 10:28:13AM -0400, Boris Ostrovsky wrote:
>> Commit 9c15a24b038f4d8da93a2bc2554731f8953a7c17 (x86/mce: Improve
>> mcheck_init_device() error handling) unregisters (or never registers)
>> MCE's hotplug notifier if an error is encountered.
> Well, mcheck_init_device() did encounter errors before that commit too,
> can you please go into detail on how exactly you're triggering this?
> Which error are you talking about exactly?
You can simulate this on baremetal by having, for example,
misc_register() fail (just add 'err = -EOI' after the call). Or you can
return an error right upon entry to mcheck_init_device() (I haven't
tested that though).
Then, after you are booted do a couple of
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online
Then sit still for about 10 minutes. I don't think any activity is
necessary.
You are dead now. If you are lucky you may see messages about soft
lockups or RCU stalls but often nothing.
> Lemme guess: some xen special handling which baremetal doesn't need.
Only in the sense that on Xen misc_register() often fails. But any
failure on baremetal will result in the same behavior.
>
>> Since unplugging a CPU would normally result in the notifier deleting
>> MCE timer we are now left with the timer running if a CPU is removed on
>> a system where mcheck_init_device() had failed.
>>
>> If we later hotplug this CPU back we add this timer again in
>> mcheck_cpu_init()). Eventually the two timers start intefering with each
>> other, causing soft lockups or system hangs.
>>
>> We should leave the notifier always on and, in fact, set it up early
>> during the boot.
> We do leave it always on - we only unregister it if we've encountered an
> error.
Right. And I think we shouldn't because we leave undeleted timers.
-boris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists