lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+55aFyySYFJjV5689Ej+Rer_igR0R1S5=h0GwWpWpA6Tk1Okw@mail.gmail.com>
Date:	Fri, 26 Dec 2014 12:57:07 -0800
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Dave Jones <davej@...emonkey.org.uk>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>, Chris Mason <clm@...com>,
	Mike Galbraith <umgwanakikbuti@...il.com>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Dâniel Fraga <fragabr@...il.com>,
	Sasha Levin <sasha.levin@...cle.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Suresh Siddha <sbsiddha@...il.com>,
	Oleg Nesterov <oleg@...hat.com>,
	Peter Anvin <hpa@...ux.intel.com>,
	John Stultz <john.stultz@...aro.org>
Subject: Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones <davej@...emonkey.org.uk> wrote:
> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
>
>  > One thing I think I'll try is to try and narrow down which
>  > syscalls are triggering those "Clocksource hpet had cycles off"
>  > messages.  I'm still unclear on exactly what is doing
>  > the stomping on the hpet.
>
> First I ran trinity with "-g vm" which limits it to use just
> a subset of syscalls, specifically VM related ones.
> That triggered the messages. Further experiments revealed:

So I can trigger the false positives with my original patch quite
easily by just putting my box under some load. My numbers are nowhere
near as bad as yours, but then, I didn't put it under as much load
anyway. Just a regular "make -j64" of the kernel.

I suspect your false positives are bigger partly because of the load,
but mostly because you presumably have preemption enabled too. I don't
do preemption in my normal kernels, and that limits the damage of the
race a bit.

I have a newer version of the patch that gets rid of the false
positives with some ordering rules instead, and just for you I hacked
it up to say where the problem happens too, but it's likely too late.

The fact that the original racy patch seems to make a difference for
you does say that yes, we seem to be zeroing in on the right area
here, but I'm not seeing what's wrong. I was hoping for big jumps from
your HPET, since your "TSC unstable" messages do kind of imply that
such really big jumps can happen.

I'm attaching my updated hacky patch, although I assume it's much too
late for that machine. Don't look too closely at the backtrace
generation part, that's just a quick hack, and only works with frame
pointers enabled anyway.

So I'm still a bit unhappy about not figuring out *what* is wrong. And
I'd still like the dmidecode from that machine, just for posterity. In
case we can figure out some pattern.

So right now I can imagine several reasons:

 - actual hardware bug.

   This is *really* unlikely, though. It should hit everybody. The
HPET is in the core intel chipset, we're not talking random unusual
hardware by fly-by-night vendors here.

 - some SMM/BIOS "power management" feature.

   We've seen this before, where the SMM saves/restores the TSC on
entry/exit in order to hide itself from the system. I could imagine
similar code for the HPET counter. SMM writers use some bad drugs to
dull their pain.

   And with the HPET counter, since it's not even per-CPU, the "save
and restore HPET" will actually show up as "HPET went backwards" to
the other non-SMM CPU's if it happens

 - a bug in our own clocksource handling.

   I'm not seeing it. But maybe my patch hides it for some magical reason.

 - gremlins.

So I dunno. I hope more people will look at this after the holidays,
even if your machine is gone. My test-program to do bad things to the
HPET shows *something*, and works on any machine.

                    Linus

View attachment "patch.diff" of type "text/plain" (5169 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ