linux-kernel - Re: frequent lockups in 3.18rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+55aFzj_2tYJ1b=_7eKSz7UMuUL+qZX39r0xq5vOZkB4JmHMA@mail.gmail.com>
Date:	Sat, 20 Dec 2014 10:25:40 -0800
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Thomas Gleixner <tglx@...utronix.de>
Cc:	Dave Jones <davej@...hat.com>, Chris Mason <clm@...com>,
	Mike Galbraith <umgwanakikbuti@...il.com>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Dâniel Fraga <fragabr@...il.com>,
	Sasha Levin <sasha.levin@...cle.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Suresh Siddha <sbsiddha@...il.com>,
	Oleg Nesterov <oleg@...hat.com>,
	Peter Anvin <hpa@...ux.intel.com>
Subject: Re: frequent lockups in 3.18rc4

On Fri, Dec 19, 2014 at 5:57 PM, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> I'm claiming that the race happened *once*. And it then corrupted some
> data structure or similar sufficiently that CPU0 keeps looping.
>
> Perhaps something keeps re-adding itself to the head of the timerqueue
> due to the race.

So tick_sched_timer() does

        ktime_t now = ktime_get();
        ...
        hrtimer_forward(timer, now, tick_period);
        return HRTIMER_RESTART;

and then __run_hrtimer does

        enqueue_hrtimer(timer, base);

which just adds the timer back on the tiemr heap.

So all you need to get an infinite loop (as far as I can see) is that
hrtimer_forward() doesn't actually move the timer forward.

The most likely reason would seem to be this:

        delta = ktime_sub(now, hrtimer_get_expires(timer));

        if (delta.tv64 < 0)
                return 0;

and clearly it *should* return a positive number, since the timer has
expired, so the expiry time _should_ be smaller than "now". So it
should never trigger, and this bug is clearly impossible.

HOWEVER.

It turns out that while tick_sched_timer() does "ktime_get()" to get
the current time, the actual timer machinery does *not* do that at
all. The actual timer machinery does

        entry_time = now = hrtimer_update_base(cpu_base);

                base = cpu_base->clock_base + i;
                basenow = ktime_add(now, base->offset);

_once_ per hrtimer_clock_base. And then it iterates using that
"basenow" thing, and  compares it to the timer expiry.

So we have two different times. Now, let's think about what happens if
those clocks aren't quote in sync.

We know (since __run_hrtimer was called) that

        basenow.tv64 > hrtimer_get_softexpires_tv64(timer)

but here we have "basenow" - which is not that ktime_get(), and we
have "hrtimer_get_softexpires_tv64()" (which is not
hrtimer_get_expires() in general - we have all that "delta" range
handling, but for the scheduling tick it *should* be the same).

So I can see at least one lockup:

 - if "expires < basenow" hrtimer_interrupt() will run the timer
 - if "now < expires" hrtimer_forward() will not do anything, and will
just reschedule the timer with the same expiration

iow, all you need for a re-arming of the same timer is:

   now < expires < basenow

now, the two clocks (now and basenow) are not the same, but they do
have the same *base*. So they are related, but even the base time was
gotten under two different sequence locks, so even the base could have
been updated in between the hrtimer_update_base() time and the
ktime_get(). And even though they have the same base, they have
different offsets: basenow does that "base->offset" thing (and
ktime_get_update_offsets_now() does timekeeping_get_ns()

 - now = ktime_get() does

                base = tk->tkr.base_mono;
                nsecs = timekeeping_get_ns(&tk->tkr);

 - basenow = ktime_get_update_offsets_now() does

                base = tk->tkr.base_mono;
                nsecs = timekeeping_get_ns(&tk->tkr);
          .. and then ..
                ktime_add(.., base->offset);

and if I read the thing right, the ktime_add() should be a no-op,
because base->offset should be 0 for the normal monotonic clock.
Right?

So the two times (now and basenow) *should* be the same time, and the
whole "now < expires < basenow" situation can never happen. Right?

Potentially wrong.

Because that's where the whole "different sequence locks" comes in.
The one-time race could be something that updates the base in between
the (one-time) ktime_get_update_offsets_now() and the (then as a
result pseudo-infinitely repeating) ktime_get.

Hmm? If "base" ever goes backwards, or if "base" does *not* update
atomically with the HPET timer overflows, I can see that happening. Of
course, that would imply that ktime_get() is not monotonic. But we do
know that we've had odd time issues on that machine.

I think you already had DaveJ check clock monotonicity. But that was
with the TSC, wasn't it? I'm claiming maybe the HPET isn't monotonic,
and there is some HPET clocksource issue with overflow in 32 bits.

(I think the HPET *should* be 64-bit, and just the comparators for
interrupts may be 32-bit, but we use a "readl()" and only use the low
32-bits even if the upper 32 bits *might* be ok).

I keep harping on that HPET overflow, because we actually have the 6
"locked up" HPET traces, and then we have a seventh without that
lockup, and there definitely was a overflow in 32 bits:

  [torvalds@i7 linux]$ grep -3 read_hpet+0x16 ~/dj-1.txt | grep RAX
   RAX: 0000000061fece8a RBX: 0000000000510792 RCX: 0000000000000000
   RAX: 0000000079e588fc RBX: 0000000000511d6e RCX: 0000000000000000
   RAX: 0000000091ca7f65 RBX: 0000000000513346 RCX: 0000000000000000
   RAX: 00000000a9afbd0d RBX: 000000000051491e RCX: 0000000000000000
   RAX: 00000000cbd1340c RBX: 000000000051684a RCX: 0000000000000000
   RAX: 00000000fb9d303f RBX: 00000000005193fc RCX: 0000000000000000
   RAX: 000000002b67efe4 RBX: 000000000051c224 RCX: 0000000000000004

and I have just gotten hung up on that small detail.

How/where is the HPET overflow case handled? I don't know the code enough.

(Also, maybe I shouldn't be so hung up on this *one* long trace from
DaveJ. There's been a lot of crazy traces from that machine. We've had
some time-handling questions about it before, but *most* of the traces
have not been implicating the HPET like this one, so..)

                            Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/