netdev - Re: lockdep trace from rc2.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Mon, 25 Feb 2008 11:46:24 +0100
From:	Johannes Berg <johannes@...solutions.net>
To:	Dave Jones <davej@...emonkey.org.uk>
Cc:	netdev@...r.kernel.org
Subject: Re: lockdep trace from rc2.

On Sun, 2008-02-24 at 21:22 -0500, Dave Jones wrote:
> https://bugzilla.redhat.com/show_bug.cgi?id=431038 has some more info,
> but the trace is below...
> I'll get an rc3 kernel built and ask the user to retest, but in case this
> isn't a known problem, I'm forwarding this here.

I can't fix it but I can explain it.

> Feb 24 17:53:21 cirithungol kernel: ip/10650 is trying to acquire lock:
> Feb 24 17:53:21 cirithungol kernel:  (events){--..}, at: [<c0436f9a>] flush_workqueue+0x0/0x85
> Feb 24 17:53:21 cirithungol kernel: 
> Feb 24 17:53:21 cirithungol kernel: but task is already holding lock:
> Feb 24 17:53:21 cirithungol kernel:  (rtnl_mutex){--..}, at: [<c05cea31>] rtnetlink_rcv+0x12/0x26
> Feb 24 17:53:21 cirithungol kernel: 
> Feb 24 17:53:21 cirithungol kernel: which lock already depends on the new lock.

What's happening here is that the linkwatch_work runs on the generic
schedule_work() workqueue.

> Feb 24 17:53:21 cirithungol kernel: -> #1 ((linkwatch_work).work){--..}:

The function that is called is linkwatch_event(), which acquires the
RTNL as you can see here:

> Feb 24 17:53:21 cirithungol kernel: -> #2 (rtnl_mutex){--..}:
> Feb 24 17:53:21 cirithungol kernel:        [<c04458f7>] __lock_acquire+0xa7c/0xbf4
> Feb 24 17:53:21 cirithungol kernel:        [<c05cea1d>] rtnl_lock+0xf/0x11
> Feb 24 17:53:21 cirithungol kernel:        [<c04415dc>] tick_program_event+0x31/0x55
> Feb 24 17:53:21 cirithungol kernel:        [<c0445ad9>] lock_acquire+0x6a/0x90
> Feb 24 17:53:21 cirithungol kernel:        [<c05cea1d>] rtnl_lock+0xf/0x11
> Feb 24 17:53:21 cirithungol kernel:        [<c0638d21>] mutex_lock_nested+0xdb/0x271
> Feb 24 17:53:21 cirithungol kernel:        [<c05cea1d>] rtnl_lock+0xf/0x11
> Feb 24 17:53:21 cirithungol kernel:last message repeated 2 times
> Feb 24 17:53:21 cirithungol kernel:        [<c05cf755>] linkwatch_event+0x8/0x22

The problem with that is that tulip_down() calls flush_scheduled_work()
while holding the RTNL:

> Feb 24 17:53:21 cirithungol kernel:        [<c0436f9a>] flush_workqueue+0x0/0x85
> Feb 24 17:53:21 cirithungol kernel:        [<c043702c>] flush_scheduled_work+0xd/0xf
> Feb 24 17:53:21 cirithungol kernel:        [<f8f4380a>] tulip_down+0x20/0x1a3 [tulip]
[...]
> Feb 24 17:53:21 cirithungol kernel:        [<c05cea3d>] rtnetlink_rcv+0x1e/0x26

(rtnetlink_rcv will acquire the RTNL)

The deadlock that can now happen is that linkwatch_work is scheduled on
the workqueue but not running yet. During tulip_down(),
flush_scheduled_work() is called which will wait for everything that is
scheduled to complete. Among those things could be linkwatch_event()
which will start running and try to acquire the RTNL. Because that is
already locked it will wait for the RTNL, but on the other hand we're
waiting for linkwatch_event() to finish while holding the RTNL.

The fix here would most likely be to not use flush_scheduled_work() but
rather cancel_work_sync().

This should be a correct change afaict, unless tulip has more work
structs than the media work.

@@ tulip_down
-	flush_scheduled_work();
+	cancel_work_sync(&tp->media_work);

johannes

Download attachment "signature.asc" of type "application/pgp-signature" (829 bytes)