linux-kernel - Re: x86/mce: machine check warning during poweroff

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1326856624.5291.20.camel@sbsiddha-mobl2>
Date:	Tue, 17 Jan 2012 19:17:04 -0800
From:	Suresh Siddha <suresh.b.siddha@...el.com>
To:	"Srivatsa S. Bhat" <srivatsa.bhat@...ux.vnet.ibm.com>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Ming Lei <tom.leiming@...il.com>,
	Djalal Harouni <tixxdz@...ndz.org>,
	Borislav Petkov <borislav.petkov@....com>,
	Tony Luck <tony.luck@...el.com>,
	Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>,
	Ingo Molnar <mingo@...e.hu>, Andi Kleen <ak@...ux.intel.com>,
	linux-kernel@...r.kernel.org, Greg Kroah-Hartman <gregkh@...e.de>,
	Kay Sievers <kay.sievers@...y.org>,
	gouders@...bocholt.fh-gelsenkirchen.de,
	Marcos Souza <marcos.mage@...il.com>,
	Linux PM mailing list <linux-pm@...r.kernel.org>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	"tglx@...utronix.de" <tglx@...utronix.de>,
	prasad@...ux.vnet.ibm.com, justinmattock@...il.com,
	Jeff Chua <jeff.chua.linux@...il.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Mel Gorman <mgorman@...e.de>,
	Gilad Ben-Yossef <gilad@...yossef.com>
Subject: Re: x86/mce: machine check warning during poweroff

On Tue, 2012-01-17 at 15:22 +0530, Srivatsa S. Bhat wrote:
> Thanks for the patch, but unfortunately it doesn't fix the problem!
> Exactly the same stack traces are seen during a CPU Hotplug stress test.
> (I didn't even have to stress it - it is so fragile that just a script
> to offline all cpus except the boot cpu was good enough to reproduce the
> problem easily.)

hmm, that's weird. with the patch, sched_ilb_notifier() should have
cleared the cpu going offline from the nohz.idle_cpus_mask. And this
should have happened after that cpu is removed from active mask. So
no-one else should add that cpu back to the nohz.idle_cpus_mask and this
should prevent the issue from happening.

I could reproduce the problem easily with out the patch but when I
applied the patch I couldn't recreate the issue. Srivatsa, can you
please re-check the kernel you tested indeed has the fix?

re-Reviewing the code/patch also doesn't give me a hint.

> I have a few questions regarding the synchronization with CPU Hotplug.
> What guarantees that the code which selects and IPIs the new ilb is totally
> race-free with respect to CPU hotplug and we will never IPI an offline CPU?

So, nohz_balancer_kick() gets called only from interrupts disabled.
During that time (from selecting the ilb_cpu to sending the IPI), no cpu
can go offline. As the offline happens from the stop-machine process
context with interrupts disabled.

Only thing we need to make sure is the offlined cpu shouldn't be part of
the nohz.idle_cpus_mask and for post 3.2 code, posted patch ensures
that.

For 3.2 and before, when a cpu exits tickless idle, it gets removed from
the nohz.idle_cpus_mask (and also from the nohz.load_balancer). And if
the cpu is not in the active mask (while going offline), subsequent
calls to select_nohz_load_balancer() ensures that the cpu going down
doesn't update the nohz structures. So I thought 3.2 shouldn't exhibit
this problem.

> (As demonstrated above, this issue is in 3.2-rc7
> as well.)

hmm, don't think we ran into this before 3.2. So, what am I missing from
the above? I will try to reproduce it on 3.2 too.

thanks,
suresh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/