linux-kernel - Re: frequent lockups in 3.18rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Fri, 2 Jan 2015 16:27:19 -0800
From:	John Stultz <john.stultz@...aro.org>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Dave Jones <davej@...emonkey.org.uk>,
	Thomas Gleixner <tglx@...utronix.de>, Chris Mason <clm@...com>,
	Mike Galbraith <umgwanakikbuti@...il.com>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Dâniel Fraga <fragabr@...il.com>,
	Sasha Levin <sasha.levin@...cle.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Suresh Siddha <sbsiddha@...il.com>,
	Oleg Nesterov <oleg@...hat.com>,
	Peter Anvin <hpa@...ux.intel.com>
Subject: Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
> On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones <davej@...emonkey.org.uk> wrote:
>> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
>>
>>  > One thing I think I'll try is to try and narrow down which
>>  > syscalls are triggering those "Clocksource hpet had cycles off"
>>  > messages.  I'm still unclear on exactly what is doing
>>  > the stomping on the hpet.
>>
>> First I ran trinity with "-g vm" which limits it to use just
>> a subset of syscalls, specifically VM related ones.
>> That triggered the messages. Further experiments revealed:
>
> So I can trigger the false positives with my original patch quite
> easily by just putting my box under some load. My numbers are nowhere
> near as bad as yours, but then, I didn't put it under as much load
> anyway. Just a regular "make -j64" of the kernel.
>
> I suspect your false positives are bigger partly because of the load,
> but mostly because you presumably have preemption enabled too. I don't
> do preemption in my normal kernels, and that limits the damage of the
> race a bit.
>
> I have a newer version of the patch that gets rid of the false
> positives with some ordering rules instead, and just for you I hacked
> it up to say where the problem happens too, but it's likely too late.
>
> The fact that the original racy patch seems to make a difference for
> you does say that yes, we seem to be zeroing in on the right area
> here, but I'm not seeing what's wrong. I was hoping for big jumps from
> your HPET, since your "TSC unstable" messages do kind of imply that
> such really big jumps can happen.
>
> I'm attaching my updated hacky patch, although I assume it's much too
> late for that machine. Don't look too closely at the backtrace
> generation part, that's just a quick hack, and only works with frame
> pointers enabled anyway.
>
> So I'm still a bit unhappy about not figuring out *what* is wrong. And
> I'd still like the dmidecode from that machine, just for posterity. In
> case we can figure out some pattern.
>
> So right now I can imagine several reasons:
>
>  - actual hardware bug.
>
>    This is *really* unlikely, though. It should hit everybody. The
> HPET is in the core intel chipset, we're not talking random unusual
> hardware by fly-by-night vendors here.
>
>  - some SMM/BIOS "power management" feature.
>
>    We've seen this before, where the SMM saves/restores the TSC on
> entry/exit in order to hide itself from the system. I could imagine
> similar code for the HPET counter. SMM writers use some bad drugs to
> dull their pain.
>
>    And with the HPET counter, since it's not even per-CPU, the "save
> and restore HPET" will actually show up as "HPET went backwards" to
> the other non-SMM CPU's if it happens
>
>  - a bug in our own clocksource handling.
>
>    I'm not seeing it. But maybe my patch hides it for some magical reason.

So I sent out a first step validation check to warn us if we end up
with idle periods that are larger then we expect.

It doesn't yet cap the timekeeping_get_ns() output (like you're patch
effectively does), but it would be easy to do that in a following
patch.

I did notice while testing this that the max_idle_ns (max idle time we
report to the scheduler) for the hpet is only ~16sec, and we'll
overflow after just ~21seconds. This second number maps closely to the
22 second stalls seen in the  nmi watchdog reports which seems
interesting, but I also realize that qemu uses a 100MHz hpet, where as
real hardware is likely to be a bit slower, so maybe that's just
chance..

I'd be interested if folks seeing anything similar to Dave would give
my patch a shot.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/