lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 10 Apr 2014 19:23:28 -0400
From:	Valdis Kletnieks <Valdis.Kletnieks@...edu>
To:	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>
Cc:	linux-kernel@...r.kernel.org
Subject: Possible issue with commit 6f008e72cd - locking/mutex: Fix debug checks.

So recent linux-next lock up on my laptop due to a third-party module.
Bisection finds this as the source of the issue:

6f008e72cd111a119b5d8de8c5438d892aae99eb is the first bad commit
commit 6f008e72cd111a119b5d8de8c5438d892aae99eb
Author: Peter Zijlstra <peterz@...radead.org>
Date:   Wed Mar 12 13:24:42 2014 +0100

    locking/mutex: Fix debug checks

    OK, so commit:

      1d8fe7dc8078 ("locking/mutexes: Unlock the mutex without the wait_lock")

    generates this boot warning when CONFIG_DEBUG_MUTEXES=y:

      WARNING: CPU: 0 PID: 139 at /usr/src/linux-2.6/kernel/locking/mutex-debug.

    And that makes sense, because as soon as we release the lock a
    new owner can come in...

    One would think that !__mutex_slowpath_needs_to_unlock()
    implementations suffer the same, but for DEBUG we fall back to
    mutex-null.h which has an unconditional 1 for that.

    The mutex debug code requires the mutex to be unlocked after
    doing the debug checks, otherwise it can find inconsistent
    state.

I was finally able to get netconsole to cough up what happened:

Apr 10 18:52:23 turing-police [  768.277732] INFO: trying to register non-static key.
Apr 10 18:52:23 turing-police [  768.277781] the code is fine but needs lockdep annotation.
Apr 10 18:52:23 turing-police [  768.277794] turning off the locking correctness validator.
Apr 10 18:52:23 turing-police [  768.277819] CPU: 0 PID: 1237 Comm: Xorg Tainted: G           O  3.14.0-next-20140409-dirty #202
Apr 10 18:52:23 turing-police [  768.277925] Hardware name: Dell Inc. Latitude E6530/07Y85M, BIOS A11 03/12/2013
Apr 10 18:52:23 turing-police [  768.277938]  0000000000000000
Apr 10 18:52:23 turing-police  ffff8800c65c7c60
Apr 10 18:52:23 turing-police  ffffffff81668e5f
Apr 10 18:52:23 turing-police  0000000000000000
Apr 10 18:52:23 turing-police
Apr 10 18:52:23 turing-police [  768.277989]  ffff8800c65c7cd0
Apr 10 18:52:23 turing-police  ffffffff8107b9c0
Apr 10 18:52:23 turing-police  ffffffff822098d0
Apr 10 18:52:23 turing-police  ffff8800c74cd8f8
Apr 10 18:52:23 turing-police
Apr 10 18:52:23 turing-police [  768.278091]  ffff8800c74cd0d0
Apr 10 18:52:23 turing-police  0000000000000019
Apr 10 18:52:23 turing-police  0000000000000000
Apr 10 18:52:23 turing-police  ffffffff81c2d980
Apr 10 18:52:23 turing-police
Apr 10 18:52:23 turing-police [  768.278119] Call Trace:
Apr 10 18:52:23 turing-police [  768.278137]  [<ffffffff81668e5f>] dump_stack+0x4f/0xa2
Apr 10 18:52:23 turing-police [  768.278153]  [<ffffffff8107b9c0>] register_lock_class+0x10d/0x2bf
Apr 10 18:52:23 turing-police [  768.278166]  [<ffffffff8107d26b>] __lock_acquire+0x127/0xdd7
Apr 10 18:52:23 turing-police [  768.278182]  [<ffffffff816732aa>] ? _raw_spin_unlock+0x5b/0x67
Apr 10 18:52:23 turing-police [  768.278196]  [<ffffffff8107e25e>] lock_acquire+0xc9/0x156
Apr 10 18:52:23 turing-police [  768.278208]  [<ffffffff81079e4e>] ? down+0x11/0x55
Apr 10 18:52:23 turing-police [  768.278305]  [<ffffffff8167319e>] _raw_spin_lock_irqsave+0x44/0x55
Apr 10 18:52:23 turing-police [  768.278317]  [<ffffffff81079e4e>] ? down+0x11/0x55
Apr 10 18:52:23 turing-police [  768.278328]  [<ffffffff81079e4e>] down+0x11/0x55

A down() of an un-annotated lock (which I'll have to annotate, I guess).

But then the kernel goes off into the weeds with complaints like:

Apr 10 18:53:28 turing-police [  833.465272] INFO: rcu_preempt detected stalls on CPUs/tasks:
Apr 10 18:53:28 turing-police  {
Apr 10 18:53:28 turing-police }
Apr 10 18:53:28 turing-police (detected by 3, t=65004 jiffies, g=4552, c=4551, q=287)
Apr 10 18:53:28 turing-police [  833.465354] INFO: Stall ended before state dump start
Apr 10 18:55:34 turing-police [  958.896736] INFO: task systemd:1 blocked for more than 120 seconds.

from all *over* the place - systemd in a poll() syscall, ext4, snd_hda, and
a few other places.  All the symptoms of busted locking leaving something
with an unintended held lock. After a bit, the box doesn't even answer
ping anymore....

I can't be sure that it's the down() call - but nothing else actually gets
*logged*.  But the commit obviously changes the actual semantics of
*something*.  Was this intentional?




Content of type "application/pgp-signature" skipped

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ