lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070829212451.GC4810@frankl.hpl.hp.com>
Date:	Wed, 29 Aug 2007 14:24:51 -0700
From:	Stephane Eranian <eranian@....hp.com>
To:	Daniel Walker <dwalker@...sta.com>
Cc:	B.Steinbrink@....de, ak@...e.de, linux-kernel@...r.kernel.org,
	akpm@...ux-foundation.org, Stephane Eranian <eranian@....hp.com>
Subject: Re: nmi_watchdog=2 regression in 2.6.21

Daniel,

On Tue, Aug 28, 2007 at 01:13:44PM -0700, Daniel Walker wrote:
> On Tue, 2007-08-28 at 12:46 -0700, Stephane Eranian wrote:
> 
> > I think I found the problem. As I suspected, it seems there is an assymetry
> > between the 1st end 2nd counter (just like what they have on P6 core). Yet
> > for architectural perfmon v1, this restriction is supposed to be lifted.
> > 
> > Unfortunately, a quick look at the errata document at
> > http://download.intel.com/design/mobile/SPECUPDT/30922212.pdf
> > 
> > for Core Duo shows bugs A49 as 'nofix':
> >   Core Duo processor has a bug which renders the enable bit (22) of
> >   PERFEVTSEL1 inoperative. The processor behaves like former P6 cores,
> >   the enable bit of PERFEVTSEL0 controls the activation of both counters
> 
> Your patch switched the nmi from PERFEVTSEL0 to PERFEVTSEL1  (right?)..
> So 0 works, 1 does not , for me anyway ..
> 
Yes that is what the patch was doing and it was for a specific reason. On
a Core 2 Duo (which uses the 2nd generation of the architectural PMU), some
useful features (such as PEBS) requiree the use of counter 0 so we cannot
give it away to NMI.

Now on Core Duo, there is no PEBS anyway, so it is okay to use counter 0
for NMI. The problem is that the detection code in perfctr-watchdog.c
treats a Core Duo and a Core 2 Duo the same way as they both have the
X86_FEATURE_ARCH_PERFMON bit set.

I have attached a patch with handle the case of the Core Duo. Unfortunately,
I do not own one so I cannot test it. I would appreciate if you could
try re-applying my counter 0 -> 1 patch + this new one to see if you
have the problem with the NMI getting stuck.

The patch below is probably still needed to handle the case where you get
stuck.

Thanks.

> > That explains why you get the 'NMI stuck' message when using PERFEVTSEL1.
> > I suspect when using PERFEVTSEL0, then NMI watchdog is not stuck. So it
> > is possible that in case NMI is stuck the code does not cleanly shutdown the
> > NMI interrupt and you get some spurious NMI interrupt later in the boot at
> > you are stuck because you are holding a lock. We need to look at the error
> > path of check_nmi_watchdog(). I glance through it and could not find the place
> > where the APIC vector is cleared.
> 
> As far as I can tell the check_nmi_watchdog() doesn't take locks, and it
> can't safely share a lock with the NMI .. 
> 
> The patch below fixes the hang (not the stuck NMI) .. Not totally sure
> why, but the cpus are stuck in a loop waiting for the endflag which
> never comes .. This also plays with the nmi hz which might do
> something.. /proc/interrupt doesn't show any nmi's either..
> 
> Daniel
> 
> Signed-off-by: Daniel Walker <dwalker@...sta.com>
> 
> Index: linux-2.6.22/arch/i386/kernel/nmi.c
> ===================================================================
> --- linux-2.6.22.orig/arch/i386/kernel/nmi.c	2007-08-15 00:51:12.000000000 +0000
> +++ linux-2.6.22/arch/i386/kernel/nmi.c	2007-08-28 20:15:04.000000000 +0000
> @@ -82,7 +82,7 @@ static __init void nmi_cpu_busy(void *da
>  static int __init check_nmi_watchdog(void)
>  {
>  	unsigned int *prev_nmi_count;
> -	int cpu;
> +	int cpu, ret = 0;
>  
>  	if ((nmi_watchdog == NMI_NONE) || (nmi_watchdog == NMI_DEFAULT))
>  		return 0;
> @@ -125,18 +125,18 @@ static int __init check_nmi_watchdog(voi
>  	if (!atomic_read(&nmi_active)) {
>  		kfree(prev_nmi_count);
>  		atomic_set(&nmi_active, -1);
> -		return -1;
> -	}
> +		printk("nmi malfunctioning.\n");
> +		ret = -1;
> +	} else 
> +		printk("OK.\n");
>  	endflag = 1;
> -	printk("OK.\n");
> -
>  	/* now that we know it works we can reduce NMI frequency to
>  	   something more reasonable; makes a difference in some configs */
>  	if (nmi_watchdog == NMI_LOCAL_APIC)
>  		nmi_hz = lapic_adjust_nmi_hz(1);
>  
>  	kfree(prev_nmi_count);
> -	return 0;
> +	return ret;
>  }
>  /* This needs to happen later in boot so counters are working */
>  late_initcall(check_nmi_watchdog);
> 

-- 

-Stephane

View attachment "coreduo.diff" of type "text/plain" (1084 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ