linux-kernel - Re: frequent lockups in 3.18rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANcMJZA1NXUMLEveA2k5njFnWc1pCSwSeAOQH7VQCOR-QtiR8g@mail.gmail.com>
Date:	Wed, 3 Dec 2014 11:56:25 -0800
From:	John Stultz <john.stultz@...aro.org>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Dave Jones <davej@...hat.com>, Chris Mason <clm@...com>,
	Mike Galbraith <umgwanakikbuti@...il.com>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Dâniel Fraga <fragabr@...il.com>,
	Sasha Levin <sasha.levin@...cle.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: frequent lockups in 3.18rc4

On Wed, Dec 3, 2014 at 11:25 AM, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
> On Wed, Dec 3, 2014 at 11:00 AM, Dave Jones <davej@...hat.com> wrote:
>>
>> So right after sending my last mail, I rebooted, and restarted the run
>> on the same kernel again.
>>
>> As I was writing this mail, this happened.
>>
>> [  524.420897] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trinity-c178:20182]
>>
>> and that's all that made it over the console. I couldn't log in via ssh,
>> and thought "ah-ha, so it IS bad".  I walked over to reboot it, and
>> found I could actually log in on the console. check out this dmesg..
>>
>> [  503.683055] Clocksource tsc unstable (delta = -95946009388 ns)
>> [  503.692038] Switched to clocksource hpet
>> [  524.420897] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trinity-c178:20182]
>
> Interesting. That whole NMI watchdog thing happens pretty much 22s
> after the "TSC unstable" message.
>
> Have you ever seen that TSC issue before? The watchdog relies on
> comparing get_timestamp() differences, so if the timestamp was
> incorrect...
>
> Maybe that whole "clocksource_watchdog()" is bogus. That delta is
> about 96 seconds, sounds very odd. I'm not seeing how the TSC could
> actually scew up that badly, so I'd almost be more likely to blame the
> "watchdog" clock.
>
> I don't know. This piece of code:
>
>         delta = clocksource_delta(wdnow, cs->wd_last, watchdog->mask);
>
> makes no sense to me. Shouldn't it be
>
>         delta = clocksource_delta(wdnow, watchdog->wd_last, watchdog->mask);

So we store wdnow value in the cs->wd_last a few lines below, so I
don't think that's problematic.

I do recall seeing problematic watchdog behavior back in the day w/
PREEMPT_RT when a high priority task really starved the watchdog for a
long time. When we came back the hpet had wrapped, making the wd_delta
look quite small relative to the TSC delta, causing improper
disqualification of the TSC.

But in that case the watchdog would disqualify the TSC after the
stall, and here the stall is happening right afterwards. So I'm not
sure.

I'll look around for some other suspects though. The nohz ntp
improvments might be high on my list there, since it was a 3.17 item.
Will dig.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/