[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 31 Dec 2009 20:06:18 -0700
From: "Berck E. Nash" <flyboy@...il.com>
To: Mike McCormack <mikem@...g3k.org>
CC: Jarek Poplawski <jarkao2@...il.com>,
Stephen Hemminger <shemminger@...tta.com>,
netdev@...r.kernel.org, dhazelton@...er.net, mbreuer@...jas.com
Subject: Re: [PATCH] sky2: Lock transmit queue while disabling device
Well, that didn't fix it. Oops attached, looks pretty much the same to me.
Mike McCormack wrote:
> Hi Jarek,
>
> This is based on my analysis of the oops at:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=14925
>
> Specifically:
>
>>>> [ 8673.345873] sky2 eth0: receiver hang detected
>>>> [ 8673.350368] sky2 eth0: disabling interface
>>>> [ 8673.354749] BUG: unable to handle kernel NULL pointer dereference at
>>>> 0000000000000010
>>>> [ 8673.359748] IP: [<ffffffffa00373d3>] sky2_xmit_frame+0x321/0x5d8
>>>> [sky2]
>
> netif_device_detach() does not guarantee that all transmits have completed
> after it returns.
>
> CPU 1 stack will look like:
>
> dev_queue_xmit()
> HARD_TX_LOCK() -> __netif_tx_lock()
> ...
> dev_hard_start_xmit()
> ops->ndo_start_xmit() -> sky2_xmit_frame()
> sky2_xmit_frame() pushing skb to hardware
> use NULL tx_ring here
>
>
> CPU 2 stack will look like:
>
> sky2_restart()
> rtnl_lock()
> sky2_detach()
> netif_device_detach()
> sky2_down()
> printk("sky2 eth0: disabling interface")
> ...
> sky2_free_buffers(sky2);
> sky2->tx_ring = NULL;
> ...
>
> Another way to solve the problem would be to take the transmit lock in
> netif_device_detach() to make sure that any in progress transmits have
> completed before returning.
>
> Note that most of these backtraces are using the nvidia binary only
> module. This may change the timings and make the sky2 race more likely,
> or be involved in the "tx timeout" condition that triggers a sky2_restart().
>
> Will test with netif_tx_lock_bh and resubmit.
>
> thanks,
>
> Mike
>
>
>
>
> Jarek Poplawski wrote:
>> Mike McCormack wrote, On 12/31/2009 11:55 AM:
>>
>>> netif_device_detach() does not take the tx_lock, so it's
>>> possible that a call to sky2_xmit_frame is still in
>>> progress after netif_device_detach() is complete.
>>>
>>> Take netif_tx_lock() to make sure all transmits have
>>> stopped while we're disabling the devices and that
>>> no other CPU is still transmitting a frame after
>>> we've disabling the device.
>>>
>>> Proposed fix for "sky2 panic under load" reported by Berck E. Nash.
>> Could you give some scenario of the oops/fix?
>> Btw, even if it worked, you should use netif_tx_lock_bh
>> version considering sky2_detach use contexts, I guess.
>>
>> Jarek P.
>>
>>> Signed-off-by: Mike McCormack <mikem@...g3k.org>
>>> ---
>>> drivers/net/sky2.c | 2 ++
>>> 1 files changed, 2 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c
>>> index faa4841..8ae8520 100644
>>> --- a/drivers/net/sky2.c
>>> +++ b/drivers/net/sky2.c
>>> @@ -3176,7 +3176,9 @@ static void sky2_reset(struct sky2_hw *hw)
>>> static void sky2_detach(struct net_device *dev)
>>> {
>>> if (netif_running(dev)) {
>>> + netif_tx_lock(dev);
>>> netif_device_detach(dev); /* stop txq */
>>> + netif_tx_unlock(dev);
>>> sky2_down(dev);
>>> }
>>> }
>>
>
View attachment "sky2crash2.txt" of type "text/plain" (5380 bytes)
Powered by blists - more mailing lists