[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070901010053.GA29765@atjola.homenet>
Date: Sat, 1 Sep 2007 03:00:53 +0200
From: Björn Steinbrink <B.Steinbrink@....de>
To: Daniel Walker <dwalker@...sta.com>
Cc: eranian@....hp.com, ak@...e.de, linux-kernel@...r.kernel.org,
akpm@...ux-foundation.org
Subject: Re: nmi_watchdog=2 regression in 2.6.21
On 2007.08.31 17:24:46 -0700, Daniel Walker wrote:
> On Fri, 2007-08-31 at 20:06 +0200, Björn Steinbrink wrote:
>
>
> > > something to do with the nmi hertz adjustment that happens after
> > > check_nmi_watchdog() ..
> >
> > Hm hm, does the same thing (watchdog stuck after check) happen with
> > older kernels, ie. those before Stephane's changeset that made it use
> > PERFCTR1?
>
> I noticed the frequency gets turned down after check_nmi_watchdog() is
> called.. I think it's suppose to trigger once per second, but it's more
> like it updates randomly ..
It's once per second if the cpu is 100% busy, if it's just idling and
halted, the performance counters won't be increased.
> In older kernels it's very slow, but it's more consistent ..
With the same load on the box? Maybe some other changes caused the box
to behave differently (say, CFS), regarding eg. load distribution
amongst the cores.
>
> Here is some output ..
>
> morning-glory ~ # cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3
> 0: 103 0 0 0 IO-APIC-edge timer
> 1: 0 0 0 8 IO-APIC-edge i8042
> 4: 2320 0 0 1 IO-APIC-edge serial
> 8: 1 0 0 1 IO-APIC-edge rtc
> 12: 0 0 0 113 IO-APIC-edge i8042
> 14: 1143 0 0 10 IO-APIC-edge ide0
> 16: 227 0 0 1 IO-APIC-fasteoi uhci_hcd:usb2, eth0
> 18: 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1
> 19: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3
> 20: 0 0 0 1 IO-APIC-fasteoi acpi
> NMI: 150 168 124 121
> LOC: 6188 6189 6187 6184
> ERR: 0
> MIS: 0
> morning-glory ~ # cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3
> 0: 103 0 0 0 IO-APIC-edge timer
> 1: 0 0 0 8 IO-APIC-edge i8042
> 4: 2391 0 0 1 IO-APIC-edge serial
> 8: 1 0 0 1 IO-APIC-edge rtc
> 12: 0 0 0 113 IO-APIC-edge i8042
> 14: 1143 0 0 10 IO-APIC-edge ide0
> 16: 872 0 0 1 IO-APIC-fasteoi uhci_hcd:usb2, eth0
> 18: 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1
> 19: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3
> 20: 0 0 0 1 IO-APIC-fasteoi acpi
> NMI: 151 168 124 121
> LOC: 21443 21444 21442 21439
> ERR: 0
> MIS: 0
> dwalker2 ~ #
>
>
> If you look at the LOC values you'll notice a lot of time has passed,
> with only one NMI and on only one cpu ..
>
> It's possible this is something else completely tho ..
At least from the interrupt side, that box looks pretty idle, so that's
expected I'd say.
>
> > Maybe you could "activate" the Dprintk in write_watchdog_counter32() to
> > see which value gets written to the MSR? (I don't see any switch to
> > activate it, so maybe just s/Dprintk(/printk(KERN_WHATEVER / ?)
>
> Here's the only lines printed,
>
> setting INTEL_ARCH_PERFCTR0 to -0x0131385e
> setting INTEL_ARCH_PERFCTR0 to -0x0131385e
> setting INTEL_ARCH_PERFCTR0 to -0x0131385e
> setting INTEL_ARCH_PERFCTR0 to -0x0131385e
Ok, dumb I am.
The "interesting" call from p6_rearm passes NULL as desc, and thus the
printk is never called that way. But before we start flooding your logs,
could you just hog all cores with a simple cpu hog and check if that
causes the NMI counter to increase at about 1 Hz? If that doesn't work,
we can go back to that debug output.
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists