lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1188325835.2435.317.camel@dhcp193.mvista.com>
Date:	Tue, 28 Aug 2007 11:30:35 -0700
From:	Daniel Walker <dwalker@...sta.com>
To:	eranian@....hp.com
Cc:	Björn Steinbrink <B.Steinbrink@....de>,
	ak@...e.de, linux-kernel@...r.kernel.org, akpm@...ux-foundation.org
Subject: Re: nmi_watchdog=2 regression in 2.6.21

On Tue, 2007-08-28 at 10:05 -0700, Stephane Eranian wrote:
> Daniel,
> 
> On Tue, Aug 28, 2007 at 07:34:44AM -0700, Daniel Walker wrote:
> > On Tue, 2007-08-28 at 02:12 -0700, Stephane Eranian wrote:
> > > Daniel,
> > > 
> > > On Mon, Aug 27, 2007 at 04:07:54PM -0700, Daniel Walker wrote:
> > > > On Mon, 2007-08-27 at 15:55 -0700, Stephane Eranian wrote:
> > > > 
> > > > > Yet the model name looks strange. So we need to run one more test,
> > > > > as the fam/model is not enough. What we need to check is whether or
> > > > > not this processor implements architectural perfmon or not.
> > > > > 
> > > > > Could you please compile and run the attached program and send me 
> > > > > the output?
> > > > 
> > > > The output below is all the output ..
> > > > 
> > > > eax=0x7280201: version=1  num_cnt=2
> > > > 
> > > Then you have a Core Duo processor and the commit from Bjorn should
> > > fix the problem. If it does not, then there is something else wrong.
> > > Unfortunately, I do not have a Core Duo machine to try and reproduce.
> > 
> > There must be something else wrong, cause the problem persists .. As I
> > said in past emails to Bjorn, I tested his commit in git, as well as the
> > latest git all with the same issue (as well as bisecting git)..
> > 
> > If the hardware is buggy then we need some way to determine that..
> > 
> Could you instrument check_nmi_watchdog() to verify that you terminate
> this function? Normally there is a safety mechanism in there.
> 
> Another  possibility is that you get flooded with NMI interrupts and
> do not make forward progress.

Here is some output from my boot log,

md: raid1 personality registered for level 1
device-mapper: ioctl: 4.11.0-ioctl (2006-10-12) initialised: dm-devel@...hat.com
ip_tables: (C) 2000-2006 Netfilter Core Team
TCP cubic registered
NET: Registered protocol family 1
NET: Registered protocol family 17
Testing NMI watchdog ... CPU#0: NMI appears to be stuck (0->0)!
CPU#1: NMI appears to be stuck (0->0)!
CPU#2: NMI appears to be stuck (0->0)!
CPU#3: NMI appears to be stuck (0->0)!
Starting balanced_irq
<hangs>

So it does appear to make it out of the check_nmi_watchdog() function
even tho there is a problem with the watchdog .. "Starting balance_irq"
is in balanced_irq_init(). It usually hangs at this spot , but with
initcall_debug on it once hung a little further down ..

It seems like there might be some other interrupt going off too early,
that causes the system to hang..

As you can see from the log above, this system is a quad.. Two physical
cpus each with two cores.

Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ