[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 10 Nov 2009 22:48:40 -0800 (PST)
From: David Miller <davem@...emloft.net>
To: shemminger@...tta.com
Cc: karaluh@...aluh.pl, ecashin@...aid.com, roel.kluin@...il.com,
harvey.harrison@...il.com, bzolnier@...il.com,
netdev@...r.kernel.org
Subject: Re: [PATCH 04/10] AOE: use rcu to find network device
From: Stephen Hemminger <shemminger@...tta.com>
Date: Tue, 10 Nov 2009 15:53:16 -0800
> On Tue, 10 Nov 2009 15:06:17 -0800
> Stephen Hemminger <shemminger@...tta.com> wrote:
>
>> On Tue, 10 Nov 2009 15:01:49 -0500
>> Ed Cashin <ecashin@...aid.com> wrote:
>>
>> > On Tue Nov 10 13:07:37 EST 2009, shemminger@...tta.com wrote:
>> > > This gets rid of another use of read_lock(&dev_base_lock) by using
>> > > RCU. Also, it only increments the reference count of the device actually
>> > > used rather than holding and releasing every device
>> > >
>> > > Compile tested only.
>> >
>> > This function runs once a minute when the aoe driver is loaded,
>> > if you'd like to test it a bit more.
>> >
>> > It looks like there's no dev_put corresponding to the dev_hold
>> > after the changes.
>> >
>>
>> Hmm, looks like AOE actually is not ref counting the network device.
>> So my patch is incorrect.
>>
>> As it stands (before my patch), it is UNSAFE. It can decide to queue
>> packets to a device that is removed out from underneath it causing
>> reference to freed memory.
>>
>> Moving the rcu_read_lock up to aoecmd_cfg() would solve that but the
>> whole driver appears to be unsafe about device refcounting and handling
>> device removal properly.
>>
>> It needs to:
>>
>> 1. Get a device ref count when it remembers a device: (ie addif)
>> 2. Install a notifier that looks for device removal events
>> 3. In notifier, remove interface, including flushing all pending
>> skb's for that device.
>>
>> This obviously is beyond the scope of the RCU stuff.
>
> Here is a patch to get you going, it does compile but it probably
> won't work because the code doesn't handle the case of the last
> device going away from a target. This is yet another pre-existing
> bug, since if a timeout happens: ejectif() is called to remove a device,
> resend() will BUG in ifrotate() if all devices are gone.
I'm holding off on the RCU patch until these known refcount bugs are
fixed
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists