linux-kernel - Re: [PATCH] tmp patch to fix hotplug issue in CMCI storm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LFD.2.02.1206151122140.3086@ionos>
Date:	Fri, 15 Jun 2012 11:55:57 +0200 (CEST)
From:	Thomas Gleixner <tglx@...utronix.de>
To:	Chen Gong <gong.chen@...ux.intel.com>
cc:	tony.luck@...el.com, borislav.petkov@....com, x86@...nel.org,
	peterz@...radead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] tmp patch to fix hotplug issue in CMCI storm

On Fri, 15 Jun 2012, Chen Gong wrote:
> 于 2012/6/14 22:07, Thomas Gleixner 写道:
> > On Thu, 14 Jun 2012, Chen Gong wrote:
> > > this patch is based on tip tree and previous 5 patches.
> > 
> > You really don't need all this complexity to handle that. The main
> > thing is that you clear the storm state and adjust the storm counter
> > when the cpu goes offline (in case the state is ACTIVE).
> > 
> > When it comes online again then you can simply let it restart cmci. If
> > the storm on this cpu (or node) still exists then it will notice and
> > everything falls in place.
> 
> I ever tested some different scenarios, if storm on this cpu still
> exists, it triggers the CMCI and broadcast it on the sibling CPU,
> which means the counter *cmci_storm_on_cpus* will increase beyond
> the upper limit. E.g. on a 2 sockets SandyBridge-EP system (one socket
> has 8 cores and 16 threads), inject one error on one socket, you can
> watch *cmci_storm_on_cpus* = 16 becuase of CMCI broadcast, during
> this time, offline and online one CPU on this socket, firstly
> *cmci_storm_on_cpus* = 15 because of offline and ACTIVE status, and then
> *cmci_storm_on_cpus* = 31 in that CMCI is actived because of
> online.That's why I have to disable CMCI during whole online/offline
> until CMCI storm is subsided. Frankly, the logic is a little bit
> complex so that I write many comments to avoid I forget it after some
> time :-)

This does not make any sense at all.

What you are saying is that even if CPU0 run cmci_clear() the CMCI
raised on CPU1 will cause the CMCI vector to be triggered on CPU0.

So how does the whole storm machinery work in the following case:

CPU0   	    	      CPU1
cmci incoming	      cmci incoming
     storm detected   	   no storm detected yet
     cmci_clear()
     switch to poll
		      
		      cmci raised

So according to your explanation that would cause the cmci vector to
be broadcasted to CPU0 as well. Now that would cause the counter to
get a bogus increment, right ?

So instead of hacking insane crap into the code, we have simply to do
the obvious Right Thing:

Index: linux-2.6/arch/x86/kernel/cpu/mcheck/mce_intel.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ linux-2.6/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -119,6 +119,9 @@ static bool cmci_storm_detect(void)
 	unsigned long ts = __this_cpu_read(cmci_time_stamp);
 	unsigned long now = jiffies;
 
+	if (__this_cpu_read(cmci_storm_state) != CMCI_STORM_NONE)
+		return true;
+
 	if (time_before_eq(now, ts + CMCI_STORM_INTERVAL)) {
 		cnt++;
 	} else {

That will prevent damage under all circumstances, cpu hotplug
included. 

But that's too simple and comprehensible I fear.

Thanks,

	tglx