[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1203936384.13162.77.camel@johannes.berg>
Date: Mon, 25 Feb 2008 11:46:24 +0100
From: Johannes Berg <johannes@...solutions.net>
To: Dave Jones <davej@...emonkey.org.uk>
Cc: netdev@...r.kernel.org
Subject: Re: lockdep trace from rc2.
On Sun, 2008-02-24 at 21:22 -0500, Dave Jones wrote:
> https://bugzilla.redhat.com/show_bug.cgi?id=431038 has some more info,
> but the trace is below...
> I'll get an rc3 kernel built and ask the user to retest, but in case this
> isn't a known problem, I'm forwarding this here.
I can't fix it but I can explain it.
> Feb 24 17:53:21 cirithungol kernel: ip/10650 is trying to acquire lock:
> Feb 24 17:53:21 cirithungol kernel: (events){--..}, at: [<c0436f9a>] flush_workqueue+0x0/0x85
> Feb 24 17:53:21 cirithungol kernel:
> Feb 24 17:53:21 cirithungol kernel: but task is already holding lock:
> Feb 24 17:53:21 cirithungol kernel: (rtnl_mutex){--..}, at: [<c05cea31>] rtnetlink_rcv+0x12/0x26
> Feb 24 17:53:21 cirithungol kernel:
> Feb 24 17:53:21 cirithungol kernel: which lock already depends on the new lock.
What's happening here is that the linkwatch_work runs on the generic
schedule_work() workqueue.
> Feb 24 17:53:21 cirithungol kernel: -> #1 ((linkwatch_work).work){--..}:
The function that is called is linkwatch_event(), which acquires the
RTNL as you can see here:
> Feb 24 17:53:21 cirithungol kernel: -> #2 (rtnl_mutex){--..}:
> Feb 24 17:53:21 cirithungol kernel: [<c04458f7>] __lock_acquire+0xa7c/0xbf4
> Feb 24 17:53:21 cirithungol kernel: [<c05cea1d>] rtnl_lock+0xf/0x11
> Feb 24 17:53:21 cirithungol kernel: [<c04415dc>] tick_program_event+0x31/0x55
> Feb 24 17:53:21 cirithungol kernel: [<c0445ad9>] lock_acquire+0x6a/0x90
> Feb 24 17:53:21 cirithungol kernel: [<c05cea1d>] rtnl_lock+0xf/0x11
> Feb 24 17:53:21 cirithungol kernel: [<c0638d21>] mutex_lock_nested+0xdb/0x271
> Feb 24 17:53:21 cirithungol kernel: [<c05cea1d>] rtnl_lock+0xf/0x11
> Feb 24 17:53:21 cirithungol kernel:last message repeated 2 times
> Feb 24 17:53:21 cirithungol kernel: [<c05cf755>] linkwatch_event+0x8/0x22
The problem with that is that tulip_down() calls flush_scheduled_work()
while holding the RTNL:
> Feb 24 17:53:21 cirithungol kernel: [<c0436f9a>] flush_workqueue+0x0/0x85
> Feb 24 17:53:21 cirithungol kernel: [<c043702c>] flush_scheduled_work+0xd/0xf
> Feb 24 17:53:21 cirithungol kernel: [<f8f4380a>] tulip_down+0x20/0x1a3 [tulip]
[...]
> Feb 24 17:53:21 cirithungol kernel: [<c05cea3d>] rtnetlink_rcv+0x1e/0x26
(rtnetlink_rcv will acquire the RTNL)
The deadlock that can now happen is that linkwatch_work is scheduled on
the workqueue but not running yet. During tulip_down(),
flush_scheduled_work() is called which will wait for everything that is
scheduled to complete. Among those things could be linkwatch_event()
which will start running and try to acquire the RTNL. Because that is
already locked it will wait for the RTNL, but on the other hand we're
waiting for linkwatch_event() to finish while holding the RTNL.
The fix here would most likely be to not use flush_scheduled_work() but
rather cancel_work_sync().
This should be a correct change afaict, unless tulip has more work
structs than the media work.
@@ tulip_down
- flush_scheduled_work();
+ cancel_work_sync(&tp->media_work);
johannes
Download attachment "signature.asc" of type "application/pgp-signature" (829 bytes)
Powered by blists - more mailing lists