linux-kernel - Re: frequent lockups in 3.18rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+55aFy8DCMmkPRz0kqNC80pn4VkeQr2Wz2fTRm=32oH2dhfRQ@mail.gmail.com>
Date:	Sun, 21 Dec 2014 13:22:21 -0800
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Thomas Gleixner <tglx@...utronix.de>
Cc:	Dave Jones <davej@...hat.com>, Chris Mason <clm@...com>,
	Mike Galbraith <umgwanakikbuti@...il.com>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Dâniel Fraga <fragabr@...il.com>,
	Sasha Levin <sasha.levin@...cle.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Suresh Siddha <sbsiddha@...il.com>,
	Oleg Nesterov <oleg@...hat.com>,
	Peter Anvin <hpa@...ux.intel.com>
Subject: Re: frequent lockups in 3.18rc4

On Sat, Dec 20, 2014 at 1:16 PM, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> Hmm, ok, I've re-acquainted myself with it. And I have to admit that I
> can't see anything wrong. The whole "update_wall_clock" and the shadow
> timekeeping state is confusing as hell, but seems fine. We'd have to
> avoid update_wall_clock for a *long* time for overflows to occur.
>
> And the overflow in 32 bits isn't that special, since the only thing
> that really matters is the overflow of "cycle_now - tkr->cycle_last"
> within the mask.
>
> So I'm not seeing anything even halfway suspicious.

.. of course, this reminds me of the "clocksource TSC unstable" issue.

Then *simple* solution may actually be that the HPET itself is
buggered. That would explain both the "clocksource TSC unstable"
messages _and_ the "time went backwards, so now we're re-arming the
scheduler tick 'forever' until time has gone forwards again".

And googling for this actually shows other people seeing similar
issues, including hangs after switching to another clocksource. See
for example

   http://stackoverflow.com/questions/13796944/system-hang-with-possible-relevance-to-clocksource-tsc-unstable

which switches to acpi_pm (not HPET) and then hangs afterwards.

Of course, it may be the switching itself that causes some issue.

Btw, there's another reason to think that it's the HPET, I just realized.

DaveJ posted all his odd TSC unstable things, and the delta was pretty
damn random. But it did have a range: it was in the 1-251 second
range.

With a 14.318MHz clock (which is, I think, the normal HPET frequency),
a 32-bit overflow happens in about 300 seconds.

So the range of 1-251 seconds  is not entirely random. It's all in
that "32-bit HPET range".

In contrast, wrt the TSC frequency, that kind of range makes no sense at all.

                          Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/