[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALAqxLX7ad_B82A9=O30v0PVqq2LKiSF7wOkGOytC=SuBh7wcg@mail.gmail.com>
Date: Fri, 2 Jan 2015 16:27:19 -0800
From: John Stultz <john.stultz@...aro.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Dave Jones <davej@...emonkey.org.uk>,
Thomas Gleixner <tglx@...utronix.de>, Chris Mason <clm@...com>,
Mike Galbraith <umgwanakikbuti@...il.com>,
Ingo Molnar <mingo@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Dâniel Fraga <fragabr@...il.com>,
Sasha Levin <sasha.levin@...cle.com>,
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Suresh Siddha <sbsiddha@...il.com>,
Oleg Nesterov <oleg@...hat.com>,
Peter Anvin <hpa@...ux.intel.com>
Subject: Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
> On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones <davej@...emonkey.org.uk> wrote:
>> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
>>
>> > One thing I think I'll try is to try and narrow down which
>> > syscalls are triggering those "Clocksource hpet had cycles off"
>> > messages. I'm still unclear on exactly what is doing
>> > the stomping on the hpet.
>>
>> First I ran trinity with "-g vm" which limits it to use just
>> a subset of syscalls, specifically VM related ones.
>> That triggered the messages. Further experiments revealed:
>
> So I can trigger the false positives with my original patch quite
> easily by just putting my box under some load. My numbers are nowhere
> near as bad as yours, but then, I didn't put it under as much load
> anyway. Just a regular "make -j64" of the kernel.
>
> I suspect your false positives are bigger partly because of the load,
> but mostly because you presumably have preemption enabled too. I don't
> do preemption in my normal kernels, and that limits the damage of the
> race a bit.
>
> I have a newer version of the patch that gets rid of the false
> positives with some ordering rules instead, and just for you I hacked
> it up to say where the problem happens too, but it's likely too late.
>
> The fact that the original racy patch seems to make a difference for
> you does say that yes, we seem to be zeroing in on the right area
> here, but I'm not seeing what's wrong. I was hoping for big jumps from
> your HPET, since your "TSC unstable" messages do kind of imply that
> such really big jumps can happen.
>
> I'm attaching my updated hacky patch, although I assume it's much too
> late for that machine. Don't look too closely at the backtrace
> generation part, that's just a quick hack, and only works with frame
> pointers enabled anyway.
>
> So I'm still a bit unhappy about not figuring out *what* is wrong. And
> I'd still like the dmidecode from that machine, just for posterity. In
> case we can figure out some pattern.
>
> So right now I can imagine several reasons:
>
> - actual hardware bug.
>
> This is *really* unlikely, though. It should hit everybody. The
> HPET is in the core intel chipset, we're not talking random unusual
> hardware by fly-by-night vendors here.
>
> - some SMM/BIOS "power management" feature.
>
> We've seen this before, where the SMM saves/restores the TSC on
> entry/exit in order to hide itself from the system. I could imagine
> similar code for the HPET counter. SMM writers use some bad drugs to
> dull their pain.
>
> And with the HPET counter, since it's not even per-CPU, the "save
> and restore HPET" will actually show up as "HPET went backwards" to
> the other non-SMM CPU's if it happens
>
> - a bug in our own clocksource handling.
>
> I'm not seeing it. But maybe my patch hides it for some magical reason.
So I sent out a first step validation check to warn us if we end up
with idle periods that are larger then we expect.
It doesn't yet cap the timekeeping_get_ns() output (like you're patch
effectively does), but it would be easy to do that in a following
patch.
I did notice while testing this that the max_idle_ns (max idle time we
report to the scheduler) for the hpet is only ~16sec, and we'll
overflow after just ~21seconds. This second number maps closely to the
22 second stalls seen in the nmi watchdog reports which seems
interesting, but I also realize that qemu uses a 100MHz hpet, where as
real hardware is likely to be a bit slower, so maybe that's just
chance..
I'd be interested if folks seeing anything similar to Dave would give
my patch a shot.
thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists