linux-kernel - Re: frequent lockups in 3.18rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 19 Dec 2014 19:12:49 -0500
From:	Chris Mason <clm@...com>
To:	Thomas Gleixner <tglx@...utronix.de>
CC:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Dave Jones <davej@...hat.com>,
	Mike Galbraith <umgwanakikbuti@...il.com>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Dâniel Fraga <fragabr@...il.com>,
	Sasha Levin <sasha.levin@...cle.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Suresh Siddha <sbsiddha@...il.com>,
	Oleg Nesterov <oleg@...hat.com>,
	Peter Anvin <hpa@...ux.intel.com>
Subject: Re: frequent lockups in 3.18rc4



On Fri, Dec 19, 2014 at 6:22 PM, Thomas Gleixner <tglx@...utronix.de> 
wrote:
> On Fri, 19 Dec 2014, Chris Mason wrote:
>>  On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote:
>>  > Here's another pattern. In your latest thing, every single time 
>> that
>>  > CPU1 is waiting for some other CPU to pick up the IPI, we have 
>> CPU0
>>  > doing this:
>>  >
>>  > [24998.060963] NMI backtrace for cpu 0
>>  > [24998.061989] CPU: 0 PID: 2940 Comm: trinity-c150 Not tainted 
>> 3.18.0+ #108
>>  > [24998.064073] task: ffff8801bf3536b0 ti: ffff880197e0c000 
>> task.ti:
>>  > ffff880197e0c000
>>  > [24998.065137] RIP: 0010:[<ffffffff8103e006>]  
>> [<ffffffff8103e006>]
>>  > read_hpet+0x16/0x20
>>  > [24998.083577]  [<ffffffff810e0d3e>] ktime_get+0x3e/0xa0
>>  > [24998.084450]  [<ffffffff810e9cd3>] tick_sched_timer+0x23/0x160
>>  > [24998.085315]  [<ffffffff810daf96>] __run_hrtimer+0x76/0x1f0
>>  > [24998.086173]  [<ffffffff810e9cb0>] ? tick_init_highres+0x20/0x20
>>  > [24998.087025]  [<ffffffff810db2e7>] hrtimer_interrupt+0x107/0x260
>>  > [24998.087877]  [<ffffffff81031a4b>] 
>> local_apic_timer_interrupt+0x3b/0x70
>>  > [24998.088732]  [<ffffffff8179bca5>] 
>> smp_apic_timer_interrupt+0x45/0x60
>>  > [24998.089583]  [<ffffffff8179a0df>] 
>> apic_timer_interrupt+0x6f/0x80
>>  > [24998.090435]  <EOI>
>>  > [24998.091279]  [<ffffffff810da66e>] ? __remove_hrtimer+0x4e/0xa0
>>  > [24998.092118]  [<ffffffff812c7c7a>] ? ipcget+0x8a/0x1e0
>>  > [24998.092951]  [<ffffffff812c7c6c>] ? ipcget+0x7c/0x1e0
>>  > [24998.093779]  [<ffffffff812c8d6d>] SyS_msgget+0x4d/0x70
>>  >
>>  > and I think that's the smoking gun. The reason CPU0 isn't picking 
>> up
>>  > any IPI's is because it is in some endless loop around 
>> read_hpet().
>>  >
>>  > There is even time information in the register dump:
>>  >
>>  >  RAX: 0000000061fece8a RBX: 0000000000510792 RCX: 0000000000000000
>>  >  RAX: 0000000079e588fc RBX: 0000000000511d6e RCX: 0000000000000000
>>  >  RAX: 0000000091ca7f65 RBX: 0000000000513346 RCX: 0000000000000000
>>  >  RAX: 00000000a9afbd0d RBX: 000000000051491e RCX: 0000000000000000
>>  >  RAX: 00000000cbd1340c RBX: 000000000051684a RCX: 0000000000000000
>>  >  RAX: 00000000fb9d303f RBX: 00000000005193fc RCX: 0000000000000000
>>  >  RAX: 000000002b67efe4 RBX: 000000000051c224 RCX: 0000000000000004
>>  >
>>  > That RAX value is the value we just read from the HPET, and RBX 
>> seems
>>  > to be monotonically increasing too, so it's likely the sequence
>>  > counter in ktime_get().
>>  >
>>  > So it's not stuck *inside* read_hpet(), and it's almost certainly 
>> not
>>  > the loop over the sequence counter in ktime_get() either (it's not
>>  > increasing *that* quickly). But some basically infinite 
>> __run_hrtimer
>>  > thing or something?
>> 
>>  Really interesting.
>> 
>>  So, we're calling __run_hrtimer in a loop:
>> 
>>                  while ((node = timerqueue_getnext(&base->active))) {
>>  				...
>>  				__run_hrtimer(timer, &basenow);
>>  				...
>>  		}
>> 
>>  The easy question is how often does trinity call nanosleep?
>> 
>>  Looking at __run_hrtimer(), it drops the lock and runs the function 
>> and then
>>  takes the lock again, maybe enqueueing us again right away.
>> 
>>  timer->state is supposed to protect us from other CPUs jumping in 
>> and doing
>>  something else with the timer, but it feels racey wrt 
>> remove_hrtimer().
>>  Something like this, but I'm not sure how often 
>> __hrtimer_start_range_ns gets
>>  called
>> 
>>  CPU 0						CPU 1
>>  __run_hrtimer()
>>      timer->state = HRTIMER_STATE_CALLBACK
>>      removed from list
>>      unlock cpu_base->lock
>>      restrt = fn(timer)
>>      						__hrtimer_start_range_ns()
>>  						base = lock_hrtimer_base()
>>  						ret = remove_hrtimer()
>>  						    finds timer->state = HRTIMER_STATE_CALLBACK
>>  						    does nothing
>>  						new_base = switch_hrtimer_base()
>>  						    now we're on a different base, different lock
>>      lock(cpu_base->lock)
>>      enqueue the timer
>>      						enqueue the timer
> 
> But at the very end this would be detected by the runtime check of the
> hrtimer interrupt, which does not trigger. And it would trigger at
> some point as ALL cpus including CPU0 in that trace dump make
> progress.

I'll admit that at some point we should be hitting one of the WARN or 
BUG_ON, but it's possible to thread that needle and corrupt the timer 
list, without hitting a warning (CPU 1 in my example has to enqueue 
last).  Once the rbtree is hosed, it can go forever.  Probably not the 
bug we're looking for, but still suspect in general.

-chris



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/