[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <3817.1397172208@turing-police.cc.vt.edu>
Date: Thu, 10 Apr 2014 19:23:28 -0400
From: Valdis Kletnieks <Valdis.Kletnieks@...edu>
To: Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...nel.org>
Cc: linux-kernel@...r.kernel.org
Subject: Possible issue with commit 6f008e72cd - locking/mutex: Fix debug checks.
So recent linux-next lock up on my laptop due to a third-party module.
Bisection finds this as the source of the issue:
6f008e72cd111a119b5d8de8c5438d892aae99eb is the first bad commit
commit 6f008e72cd111a119b5d8de8c5438d892aae99eb
Author: Peter Zijlstra <peterz@...radead.org>
Date: Wed Mar 12 13:24:42 2014 +0100
locking/mutex: Fix debug checks
OK, so commit:
1d8fe7dc8078 ("locking/mutexes: Unlock the mutex without the wait_lock")
generates this boot warning when CONFIG_DEBUG_MUTEXES=y:
WARNING: CPU: 0 PID: 139 at /usr/src/linux-2.6/kernel/locking/mutex-debug.
And that makes sense, because as soon as we release the lock a
new owner can come in...
One would think that !__mutex_slowpath_needs_to_unlock()
implementations suffer the same, but for DEBUG we fall back to
mutex-null.h which has an unconditional 1 for that.
The mutex debug code requires the mutex to be unlocked after
doing the debug checks, otherwise it can find inconsistent
state.
I was finally able to get netconsole to cough up what happened:
Apr 10 18:52:23 turing-police [ 768.277732] INFO: trying to register non-static key.
Apr 10 18:52:23 turing-police [ 768.277781] the code is fine but needs lockdep annotation.
Apr 10 18:52:23 turing-police [ 768.277794] turning off the locking correctness validator.
Apr 10 18:52:23 turing-police [ 768.277819] CPU: 0 PID: 1237 Comm: Xorg Tainted: G O 3.14.0-next-20140409-dirty #202
Apr 10 18:52:23 turing-police [ 768.277925] Hardware name: Dell Inc. Latitude E6530/07Y85M, BIOS A11 03/12/2013
Apr 10 18:52:23 turing-police [ 768.277938] 0000000000000000
Apr 10 18:52:23 turing-police ffff8800c65c7c60
Apr 10 18:52:23 turing-police ffffffff81668e5f
Apr 10 18:52:23 turing-police 0000000000000000
Apr 10 18:52:23 turing-police
Apr 10 18:52:23 turing-police [ 768.277989] ffff8800c65c7cd0
Apr 10 18:52:23 turing-police ffffffff8107b9c0
Apr 10 18:52:23 turing-police ffffffff822098d0
Apr 10 18:52:23 turing-police ffff8800c74cd8f8
Apr 10 18:52:23 turing-police
Apr 10 18:52:23 turing-police [ 768.278091] ffff8800c74cd0d0
Apr 10 18:52:23 turing-police 0000000000000019
Apr 10 18:52:23 turing-police 0000000000000000
Apr 10 18:52:23 turing-police ffffffff81c2d980
Apr 10 18:52:23 turing-police
Apr 10 18:52:23 turing-police [ 768.278119] Call Trace:
Apr 10 18:52:23 turing-police [ 768.278137] [<ffffffff81668e5f>] dump_stack+0x4f/0xa2
Apr 10 18:52:23 turing-police [ 768.278153] [<ffffffff8107b9c0>] register_lock_class+0x10d/0x2bf
Apr 10 18:52:23 turing-police [ 768.278166] [<ffffffff8107d26b>] __lock_acquire+0x127/0xdd7
Apr 10 18:52:23 turing-police [ 768.278182] [<ffffffff816732aa>] ? _raw_spin_unlock+0x5b/0x67
Apr 10 18:52:23 turing-police [ 768.278196] [<ffffffff8107e25e>] lock_acquire+0xc9/0x156
Apr 10 18:52:23 turing-police [ 768.278208] [<ffffffff81079e4e>] ? down+0x11/0x55
Apr 10 18:52:23 turing-police [ 768.278305] [<ffffffff8167319e>] _raw_spin_lock_irqsave+0x44/0x55
Apr 10 18:52:23 turing-police [ 768.278317] [<ffffffff81079e4e>] ? down+0x11/0x55
Apr 10 18:52:23 turing-police [ 768.278328] [<ffffffff81079e4e>] down+0x11/0x55
A down() of an un-annotated lock (which I'll have to annotate, I guess).
But then the kernel goes off into the weeds with complaints like:
Apr 10 18:53:28 turing-police [ 833.465272] INFO: rcu_preempt detected stalls on CPUs/tasks:
Apr 10 18:53:28 turing-police {
Apr 10 18:53:28 turing-police }
Apr 10 18:53:28 turing-police (detected by 3, t=65004 jiffies, g=4552, c=4551, q=287)
Apr 10 18:53:28 turing-police [ 833.465354] INFO: Stall ended before state dump start
Apr 10 18:55:34 turing-police [ 958.896736] INFO: task systemd:1 blocked for more than 120 seconds.
from all *over* the place - systemd in a poll() syscall, ext4, snd_hda, and
a few other places. All the symptoms of busted locking leaving something
with an unintended held lock. After a bit, the box doesn't even answer
ping anymore....
I can't be sure that it's the down() call - but nothing else actually gets
*logged*. But the commit obviously changes the actual semantics of
*something*. Was this intentional?
Content of type "application/pgp-signature" skipped
Powered by blists - more mailing lists