linux-kernel - Re: frequent lockups in 3.18rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sat, 20 Dec 2014 02:00:56 +0100 (CET)
From:	Thomas Gleixner <tglx@...utronix.de>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
cc:	Dave Jones <davej@...hat.com>, Chris Mason <clm@...com>,
	Mike Galbraith <umgwanakikbuti@...il.com>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Dâniel Fraga <fragabr@...il.com>,
	Sasha Levin <sasha.levin@...cle.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Suresh Siddha <sbsiddha@...il.com>,
	Oleg Nesterov <oleg@...hat.com>,
	Peter Anvin <hpa@...ux.intel.com>
Subject: Re: frequent lockups in 3.18rc4

On Fri, 19 Dec 2014, Linus Torvalds wrote:
> On Fri, Dec 19, 2014 at 3:14 PM, Thomas Gleixner <tglx@...utronix.de> wrote:
> > Now that all looks correct. So there is something else going on. After
> > staring some more at it, I think we are looking at it from the wrong
> > angle.
> >
> > The watchdog always detects CPU1 as stuck and we got completely
> > fixated on the csd_wait() in the stack trace on CPU1. Now we have
> > stack traces which show a different picture, i.e. CPU1 makes progress
> > after a gazillion of seconds.
> 
> .. but that doesn't explain why CPU0 ends up always being at that
> *exact* same instruction in the NMI backtrace.

Up to the point where it exposes different instructions and completely
different code pathes. I'm not agreeing with your theory that after
the RT throttler hit the watchdog and everything else goes completely
bonkers. And even before that happens we see a different backtrace on
CPU0:

[25149.982766] RIP: 0010:[<ffffffff810cf1db>]  [<ffffffff810cf1db>] invoke_rcu_core+0x2b/0x50

Though I have to admit that this "very same instruction" pattern
puzzled me for quite a while as well.

> While a fairly tight loop, together with "mmio read is very expensive
> and synchronizing" would explain it. An MMIO read can easily be as
> expensive as several thousand instructions.

The watchdog timer runs on a fully periodic schedule. It's self
rearming via

	 hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period));

So if that aligns with the equally periodic tick interrupt on the
other CPU then you might get into that situation due to the fully
synchronizing and serializing nature of HPET reads.

That can drift apart over time because the timer device (apic or that
new fangled tscdeadline timer) are not frequency corrected versus
timekeeping.

> > I think we really need to look at CPU1 itself.
> 
> Not so fast. Take another look at CPU0.
> 
> [24998.083577]  [<ffffffff810e0d3e>] ktime_get+0x3e/0xa0
> [24998.084450]  [<ffffffff810e9cd3>] tick_sched_timer+0x23/0x160
> [24998.085315]  [<ffffffff810daf96>] __run_hrtimer+0x76/0x1f0
> [24998.086173]  [<ffffffff810e9cb0>] ? tick_init_highres+0x20/0x20
> [24998.087025]  [<ffffffff810db2e7>] hrtimer_interrupt+0x107/0x260
> [24998.087877]  [<ffffffff81031a4b>] local_apic_timer_interrupt+0x3b/0x70
> [24998.088732]  [<ffffffff8179bca5>] smp_apic_timer_interrupt+0x45/0x60
> [24998.089583]  [<ffffffff8179a0df>] apic_timer_interrupt+0x6f/0x80
> [24998.090435]  <EOI>
> [24998.091279]  [<ffffffff810da66e>] ? __remove_hrtimer+0x4e/0xa0
> [24998.092118]  [<ffffffff812c7c7a>] ? ipcget+0x8a/0x1e0
> [24998.092951]  [<ffffffff812c7c6c>] ? ipcget+0x7c/0x1e0
> [24998.093779]  [<ffffffff812c8d6d>] SyS_msgget+0x4d/0x70
> 
> 
> Really. None of that changed. NONE. The likelihood that we hit the
> exact same instruction every time? Over a timefraem of more than a
> minute?
> 
> The only way I see that happening is (a) NMI is completely buggered,
> and the backtrace is just random crap that is always the same.  Or (b)
> it's really a fairly tight loop.
> 
> The fact that you had a hrtimer interrupt happen in the *middle* of
> __remove_hrtimer() is really another fairly strong hint. That smells
> like "__remove_hrtimer() has a race with hrtimer interrupts".

That __remove_hrtimer has a '?' in front of it. So its not a reliable
trace entry.

There is NO hrtimer related operation in the msgget() syscall at all
and SyS_msgget() is the only reliable entry on that stack trace.

So that __remove_hrtimer operation happened before that msgget()
syscall and is just a stack artifact. poll/select/nanosleep whatever.

> And that race results in a basically endless loop (which perhaps ends
> when the HRtimer overflows, in what, a few minutes?)

hrtimers overflow in about 584 years
 
> I really don't think you should look at CPU1. Not when CPU0 has such
> an interesting pattern that you dismissed just because the HPET is
> making progress.

No. I did not dismiss it because HPET is making progress. I looked at
it from a different angle.

So lets assume there is that hrtimer_remove race (I'm certainly going
to stare at that tomorrow with fully awake brain. It's past beer
o'clock here). How do you explain that:

1) the watchdog always triggers on CPU1?

2) the race only ever happens on CPU0?

3) the hrtimer interrupt took too long message never appears?

   If that timer interrupt loops forever then it will complain about
   that. And it leaves that code for sure as the backtrace of CPU0
   hits user space later on.

4) the RT throttler hit?

   Admittetly we dont know from which CPU and which task that comes
   but that's very simple to figure out. Debug patch below.

5) that the system makes progress afterwards?

6) ....

If my assumption about an interrupt storm turns out to be true, then
it explain all of the above. I might be wrong as usual, but I still
think its worth to have a look.

Thanks,

	tglx
---

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ee15f5a0d1c1..d9e4153d405b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -895,7 +895,8 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
 		 */
 		if (likely(rt_b->rt_runtime)) {
 			rt_rq->rt_throttled = 1;
-			printk_deferred_once("sched: RT throttling activated\n");
+			printk_deferred_once("sched: RT throttling activated cpu %d task %s %d\n",
+					     smp_processor_id(), current->comm, current->pid);
 		} else {
 			/*
 			 * In case we did anyway, make it go away,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/