[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <10270.1283370390@death>
Date: Wed, 01 Sep 2010 12:46:30 -0700
From: Jay Vosburgh <fubar@...ibm.com>
To: Jarek Poplawski <jarkao2@...il.com>
cc: Jiri Bohac <jbohac@...e.cz>, bonding-devel@...ts.sourceforge.net,
markine@...gle.com, chavey@...gle.com, netdev@...r.kernel.org
Subject: Re: [RFC] bonding: fix workqueue re-arming races
Jarek Poplawski <jarkao2@...il.com> wrote:
>On Wed, Sep 01, 2010 at 09:11:06PM +0200, Jiri Bohac wrote:
>> On Wed, Sep 01, 2010 at 09:00:37PM +0200, Jarek Poplawski wrote:
>> > On Wed, Sep 01, 2010 at 05:37:30PM +0200, Jarek Poplawski wrote:
>> > > On Wed, Sep 01, 2010 at 05:18:56PM +0200, Jarek Poplawski wrote:
>> > > > On Wed, Sep 01, 2010 at 03:30:56PM +0200, Jiri Bohac wrote:
>> > > > > On Wed, Sep 01, 2010 at 12:23:56PM +0000, Jarek Poplawski wrote:
>> > > > > > On 2010-08-31 22:54, Jay Vosburgh wrote:
>> > > > > > > What prevents this from deadlocking such that cpu A is in
>> > > > > > > bond_close, holding RTNL and in cancel_delayed_work_sync, while cpu B is
>> > > > > > > in the above function, trying to acquire RTNL?
>> > > > > >
>> > > > > > I guess this one isn't cancelled in bond_close, so it should be safe.
>> > > > >
>> > > > > Nah, Jay was correct. Although this work item is not explicitly
>> > > > > cancelled with cancel_delayed_work_sync(), it is on the same
>> > > > > workqueue as work items that are being cancelled with
>> > > > > cancel_delayed_work_sync(), so this can still cause a deadlock.
>> > > > > Fixed in the new version of the patch by putting these on a
>> > > > > separate workqueue.
>> > > > >
>> > > >
>> > > > Maybe I miss something, but the same workqueue shouldn't matter here.
>> > >
>> > > Hmm... I missed your point completely and Jay was correct!
>> >
>> > Hmm#2... Alas, after getting back my sobriety, I've to say that Jay
>> > was wrong: the same workqueue shouldn't matter here. Similar things
>> > are done by other network code with the kernel-global workqueue, eg.
>> > in tg3_close(), rhine_close() etc.
>>
>> But these don't do rtnl_lock() inside the work item, do they?
>
>Exactly. Just like work items cancelled from bond_work_cancel_all()
>after your patch.
I see what Jarek is getting at here: the mii_commit, etc, work
items new to the patch aren't cancelled by bond_close, so bond_close (in
cancel_delayed_work_sync) shouldn't care if they're executing or not.
This still would leave the new work items (the "commit" ones
added in the patch) always free to run at some arbitrary time after
close, which makes me uneasy. I don't think the extra "wq_rtnl" makes
any difference, though.
-J
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@...ibm.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists