linux-kernel - Re: frequent lockups in 3.18rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 22 Dec 2014 11:47:37 -0800
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Dave Jones <davej@...emonkey.org.uk>,
	Thomas Gleixner <tglx@...utronix.de>, Chris Mason <clm@...com>,
	Mike Galbraith <umgwanakikbuti@...il.com>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Dâniel Fraga <fragabr@...il.com>,
	Sasha Levin <sasha.levin@...cle.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Suresh Siddha <sbsiddha@...il.com>,
	Oleg Nesterov <oleg@...hat.com>,
	Peter Anvin <hpa@...ux.intel.com>,
	John Stultz <john.stultz@...aro.org>
Subject: Re: frequent lockups in 3.18rc4

On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> This is *not* to say that this is the bug you're hitting. But it does show that
>
>  (a) a flaky HPET can do some seriously bad stuff
>  (b) the kernel is very fragile wrt time going backwards.
>
> and maybe we can use this test program to at least try to alleviate problem (b).

Ok, so after several false starts (ktime_get() is really really
fragile - called in scheduler things, and doing magic things at
bootup), here is something that seems to alleviate the problem for me.

I still get a lot of RCU  messages like "self-detected stall" etc, but
that's to be expected. When the clock does odd things, crap *will*
happen.

But what this does is:

 (a) make the error more visible as a clock error rather than various
random downstream users

     IOW, it prints things out when it looks like we're getting odd
clock read errors (arbitrary cut-off: we expect clock read-outs to be
withing 1/8th of the range of the expected clock value)

 (b) try to alleviate the horrible things that happen when the clock
error is big

     The patch tries to "correct" for the huge time jump by basically
undoing it. We'll still see time jumps (there really is no way to
avoid it), but we limit the range of them.

With the attached patch, my machine seems to survive me writing to the
HPET master counter register. It spews warnings, and it is noisy about
the odd clock reads:

    ...
    Clocksource hpet had cycles off by 642817751
    Cutting it too close for hpet in in update_wall_time (offset = 4034102337)
    INFO: rcu_sched self-detected stall on CPU { 0}  (t=281743 jiffies
g=4722 c=4721 q=14)
    ...

and there may still be situations where it does horrible horrible
things due to the (smaller) time leaps, but it does seem a lot more
robust.

NOTE! There's an (intentional) difference in how we handle the time
leaps at time read time vs write (wall-clock update).

At time read time, we just refuse to believe the big delta, and we set
the "cycle_error" value so that future time reads will be relative to
the error we just got. We also don't print anything out, because we're
possibly deep in the scheduler or in tracing, and we do not want to
spam the console about our fixup.

At time *write* time, we first report about any read-time errors, and
then we report (but believe in) overlarge clocksource delta errors as
we update the time.

This seems to be the best way to limit the damage.

Also note that the patch is entirely clock-agnostic. It's just that I
can trivially screw up my HPET, I didn't look at other clocks.

One note: my current limit of clocksource delta errors is based on the
range of the clock (1/8th of the range). I actually think that's
bogus, and it should instead be based on the expected frequency of the
clock (ie "we are guaranteed to update the wall clock at least once
every second, so if the clock source delta read is larger than one
second, we've done something wrong"). So this patch is meant very much
as an RFC, rather than anything else. It's pretty hacky. But it does
actually make a huge difference for me wrt the "mess up HPET time on
purpose". That used to crash my machine pretty hard, and pretty
reliably. With this patch, I've done it ten+ times, and while it spews
a lot of garbage, the machine stays up and _works_.

Making the sanity check tighter (ie the "one second" band rather than
"1/8th of the clock range") would probably just improve it further.

Thomas, what do you think? Hate it? Any better ideas?

And again: this is not trying to make the kernel clock not jump. There
is no way I can come up with even in theory to try to really *fix* a
fundamentally broken clock.

So this is not meant to be a real "fix" for anything, but is meant to
make sure that if the clock is unreliable, we pinpoint the clock
itself, and it mitigates the absolutely horrendously bad behavior we
currently with bad clocks. So think of this as debug and band-aid
rather than "this makes clocks magically reliable".

.. and we might still lock up under some circumstances. But at least
from my limited testing, it is infinitely much better, even if it
might not be perfect. Also note that my "testing" has been writing
zero to the HPET lock (so the HPET clock difference tends to be pretty
specific), while my next step is to see what happens when I write
random values (and a lot of them).

Since I expect that to cause more problems, I thought I'd send this
RFC out before I start killing my machine again ;)

                             Linus

View attachment "patch.diff" of type "text/plain" (3206 bytes)