[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150915214244.GS4029@linux.vnet.ibm.com>
Date: Tue, 15 Sep 2015 14:42:44 -0700
From: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To: Tim Spriggs <tspriggs@...le.com>
Cc: peterz@...radead.org, davej@...emonkey.org.uk,
sasha.levin@...cle.com, linux-kernel@...r.kernel.org
Subject: Re: unpinning an unpinned lock
On Mon, Sep 14, 2015 at 04:22:53PM -0700, Tim Spriggs wrote:
> Hello!
Adding Sasha and LKML on CC.
> I noticed some traffic at http://www.gossamer-threads.com/lists/linux/kernel/2229593 that you three generated. I’ve run into a similar issue and I’m wondering if I can contribute anything with the info I have. Initially I was seeing deadlocks (giant load averages and processes stop responding) and I was able to get a stuck-app stack-traces. They (cron/sshd/…) looked like this:
>
> [71458.731530] Call Trace:
> [71458.734130] [<ffffffff815614ce>] schedule+0x3e/0x90
> [71458.739250] [<ffffffff81561578>] schedule_preempt_disabled+0x18/0x30
> [71458.745857] [<ffffffff81562ee5>] __mutex_lock_slowpath+0x95/0x130
> [71458.752193] [<ffffffff81562f9b>] mutex_lock+0x1b/0x30
> [71458.757490] [<ffffffff810f2bc1>] audit_receive+0x21/0xb0
> [71458.763049] [<ffffffff814cbf05>] netlink_unicast+0x185/0x210
> [71458.768954] [<ffffffff814cccca>] netlink_sendmsg+0x32a/0x3b0
> [71458.774861] [<ffffffff8148695d>] sock_sendmsg+0x4d/0x60
> [71458.780336] [<ffffffff8148861f>] SYSC_sendto+0xef/0x130
> [71458.785806] [<ffffffff8117e679>] ? __handle_mm_fault+0xf9/0x190
> [71458.791970] [<ffffffff8101234b>] ? syscall_trace_enter_phase1+0x13b/0x170
> [71458.799006] [<ffffffff8148866e>] SyS_sendto+0xe/0x10
> [71458.804214] [<ffffffff81565517>] system_call_fastpath+0x12/0x6a
>
> I enabled lock debugging in the kernel and ran it across a bunch of nodes and waited for call traces. After the weekend, I found 3 distinct traces on 16 nodes.
>
> All nodes report from: kernel/locking/lockdep.c:3497 lock_unpin_lock+0xed/0xf0()
Dave Jones was seeing lock_unpin_lock(), IIRC. Me, I was seeing
infrequent silent hangs.
Thanx, Paul
> All nodes state: unpinning an unpinned lock
>
> All nodes have these modules linked in: ahci(E) dca(E) dm_log(E) dm_mirror(E) dm_mod(E) dm_region_hash(E) hwmon(E) i2c_algo_bit(E) i2c_core(E) i2c_i801(E) iTCO_vendor_support(E) iTCO_wdt(E) igb(E) ioatdma(E) ipmi_msghandler(E) ipmi_si(E) ipmi_ssif(E) ixgbe(E) libahci(E) lpc_ich(E) mdio(E) mfd_core(E) microcode(E) overlay(E) pps_core(E) ptp(E) sb_edac(E) sd_mod(E) sg(E) vxlan(E) wmi(E) xhci_hcd(E) xhci_pci(E)
>
> There was one variation on Workqueue:
> 15 Workqueue: events key_garbage_collector
> 1 Workqueue: kdmflush dm_wq_work [dm_mod]
>
> I’m hoping that the extra info will help in your debugging/tracing this error down.
>
> Cheers,
> -Tim
>
> —
>
> [<ffffffff816b2a4e>] dump_stack+0x48/0x5a
> [<ffffffff81053545>] warn_slowpath_common+0x95/0xe0
> [<ffffffff81053646>] warn_slowpath_fmt+0x46/0x50
> [<ffffffff810a49dd>] lock_unpin_lock+0xed/0xf0
> [<ffffffff810962f1>] pick_next_task_fair+0x51/0x220
> [<ffffffff816b2c9c>] __schedule+0x12c/0x990
> [<ffffffff810a9eab>] ? __lock_acquire+0x3ab/0x560
> [<ffffffff816b367e>] schedule+0x3e/0x80
> [<ffffffff816b74d6>] schedule_timeout+0x1c6/0x220
> [<ffffffff816b4711>] ? wait_for_completion+0x91/0x110
> [<ffffffff810a8920>] ? mark_held_locks+0x70/0x90
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff810a8bef>] ? trace_hardirqs_on_caller+0xff/0x1c0
> [<ffffffff816b4719>] wait_for_completion+0x99/0x110
> [<ffffffff810875e0>] ? try_to_wake_up+0x2e0/0x2e0
> [<ffffffff810bed00>] ? __call_rcu.clone.0+0x1a0/0x1a0
> [<ffffffff810bb760>] wait_rcu_gp+0x50/0x60
> [<ffffffff810bb770>] ? wait_rcu_gp+0x60/0x60
> [<ffffffff810c02f7>] synchronize_sched+0x47/0xb0
> [<ffffffff813412dd>] key_garbage_collector+0x19d/0x3e0
> [<ffffffff8106f95f>] process_one_work+0x1df/0x570
> [<ffffffff8106f8e0>] ? process_one_work+0x160/0x570
> [<ffffffff8106fe81>] ? worker_thread+0x191/0x420
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff8106fe0d>] worker_thread+0x11d/0x420
> [<ffffffff810a8cbd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8106fcf0>] ? process_one_work+0x570/0x570
> [<ffffffff8106fcf0>] ? process_one_work+0x570/0x570
> [<ffffffff81076a5e>] kthread+0xde/0x100
> [<ffffffff81080e6e>] ? schedule_tail+0x1e/0xd0
> [<ffffffff81076980>] ? __init_kthread_worker+0x70/0x70
> [<ffffffff816b914f>] ret_from_fork+0x3f/0x70
> [<ffffffff81076980>] ? __init_kthread_worker+0x70/0x70
>
> (… and a diff -u variation on this)
>
> @@ -5,6 +5,7 @@
> [<ffffffff810a49dd>] lock_unpin_lock+0xed/0xf0
> [<ffffffff810962f1>] pick_next_task_fair+0x51/0x220
> [<ffffffff816b2c9c>] __schedule+0x12c/0x990
> + [<ffffffff810aa247>] ? __lock_release+0xc7/0x390
> [<ffffffff810a9eab>] ? __lock_acquire+0x3ab/0x560
> [<ffffffff816b367e>] schedule+0x3e/0x80
> [<ffffffff816b74d6>] schedule_timeout+0x1c6/0x220
>
> —
>
> [<ffffffff816b2a4e>] dump_stack+0x48/0x5a
> [<ffffffff81053545>] warn_slowpath_common+0x95/0xe0
> [<ffffffff81053646>] warn_slowpath_fmt+0x46/0x50
> [<ffffffff810a49dd>] lock_unpin_lock+0xed/0xf0
> [<ffffffff816b2f21>] __schedule+0x3b1/0x990
> [<ffffffff810a8920>] ? mark_held_locks+0x70/0x90
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff816b367e>] schedule+0x3e/0x80
> [<ffffffffa0269a45>] get_active_stripe+0x295/0x430 [raid456]
> [<ffffffff8109d1a0>] ? woken_wake_function+0x20/0x20
> [<ffffffffa0271e5e>] make_request+0x1ee/0x670 [raid456]
> [<ffffffff81547446>] ? md_make_request+0x1d6/0x480
> [<ffffffff8109d1a0>] ? woken_wake_function+0x20/0x20
> [<ffffffff81370aea>] ? generic_make_request_checks+0x15a/0x3e0
> [<ffffffff81547446>] md_make_request+0x1d6/0x480
> [<ffffffff815472c3>] ? md_make_request+0x53/0x480
> [<ffffffff81163345>] ? mempool_alloc_slab+0x15/0x20
> [<ffffffff81163499>] ? mempool_alloc+0x59/0x160
> [<ffffffff81370e21>] generic_make_request+0xb1/0xe0
> [<ffffffffa000301e>] __map_bio+0xbe/0x130 [dm_mod]
> [<ffffffffa0003352>] __clone_and_map_data_bio+0x112/0x130 [dm_mod]
> [<ffffffffa00033dc>] __split_and_process_non_flush+0x6c/0xb0 [dm_mod]
> [<ffffffffa0004c47>] __split_and_process_bio+0x207/0x2c0 [dm_mod]
> [<ffffffffa0004a65>] ? __split_and_process_bio+0x25/0x2c0 [dm_mod]
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffffa0004e8f>] dm_wq_work+0xbf/0xe0 [dm_mod]
> [<ffffffff8106f95f>] process_one_work+0x1df/0x570
> [<ffffffff8106f8e0>] ? process_one_work+0x160/0x570
> [<ffffffff8106fe81>] ? worker_thread+0x191/0x420
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff8106fe0d>] worker_thread+0x11d/0x420
> [<ffffffff810a8cbd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8106fcf0>] ? process_one_work+0x570/0x570
> [<ffffffff8106fcf0>] ? process_one_work+0x570/0x570
> [<ffffffff81076a5e>] kthread+0xde/0x100
> [<ffffffff81080e6e>] ? schedule_tail+0x1e/0xd0
> [<ffffffff81076980>] ? __init_kthread_worker+0x70/0x70
> [<ffffffff816b914f>] ret_from_fork+0x3f/0x70
> [<ffffffff81076980>] ? __init_kthread_worker+0x70/0x70
>
> —
>
> [<ffffffff816b2a4e>] dump_stack+0x48/0x5a
> [<ffffffff81053545>] warn_slowpath_common+0x95/0xe0
> [<ffffffff81053646>] warn_slowpath_fmt+0x46/0x50
> [<ffffffff810a49dd>] lock_unpin_lock+0xed/0xf0
> [<ffffffff810962f1>] pick_next_task_fair+0x51/0x220
> [<ffffffff816b2c9c>] __schedule+0x12c/0x990
> [<ffffffff810aa247>] ? __lock_release+0xc7/0x390
> [<ffffffff810a9eab>] ? __lock_acquire+0x3ab/0x560
> [<ffffffff816b367e>] schedule+0x3e/0x80
> [<ffffffff816b74d6>] schedule_timeout+0x1c6/0x220
> [<ffffffff816b4711>] ? wait_for_completion+0x91/0x110
> [<ffffffff810a8920>] ? mark_held_locks+0x70/0x90
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff810a8bef>] ? trace_hardirqs_on_caller+0xff/0x1c0
> [<ffffffff816b4719>] wait_for_completion+0x99/0x110
> [<ffffffff810875e0>] ? try_to_wake_up+0x2e0/0x2e0
> [<ffffffff810bed00>] ? __call_rcu.clone.0+0x1a0/0x1a0
> [<ffffffff810bb760>] wait_rcu_gp+0x50/0x60
> [<ffffffff810bb770>] ? wait_rcu_gp+0x60/0x60
> [<ffffffff810c02f7>] synchronize_sched+0x47/0xb0
> [<ffffffff813412dd>] key_garbage_collector+0x19d/0x3e0
> [<ffffffff8106f95f>] process_one_work+0x1df/0x570
> [<ffffffff8106f8e0>] ? process_one_work+0x160/0x570
> [<ffffffff8106fe81>] ? worker_thread+0x191/0x420
> [<ffffffff816b8910>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff8106fef8>] worker_thread+0x208/0x420
> [<ffffffff810a8cbd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8106fcf0>] ? process_one_work+0x570/0x570
> [<ffffffff8106fcf0>] ? process_one_work+0x570/0x570
> [<ffffffff81076a5e>] kthread+0xde/0x100
> [<ffffffff81080e6e>] ? schedule_tail+0x1e/0xd0
> [<ffffffff81076980>] ? __init_kthread_worker+0x70/0x70
> [<ffffffff816b914f>] ret_from_fork+0x3f/0x70
> [<ffffffff81076980>] ? __init_kthread_worker+0x70/0x70
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists