[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <db6848dc-cf1b-0989-570c-af5bdd1a7bd1@gmail.com>
Date: Mon, 29 Oct 2018 20:52:39 -0700
From: Eric Dumazet <eric.dumazet@...il.com>
To: Cong Wang <xiyou.wangcong@...il.com>,
Paweł Staszewski <pstaszewski@...are.pl>
Cc: Linux Kernel Network Developers <netdev@...r.kernel.org>,
Dimitris Michailidis <dmichail@...gle.com>
Subject: Re: Latest net-next kernel 4.19.0+
On 10/29/2018 07:53 PM, Eric Dumazet wrote:
>
>
> On 10/29/2018 07:27 PM, Cong Wang wrote:
>> Hi,
>>
>> On Mon, Oct 29, 2018 at 5:19 PM Paweł Staszewski <pstaszewski@...are.pl> wrote:
>>>
>>> Sorry not complete - followed by hw csum:
>>>
>>> [ 342.190831] vlan1490: hw csum failure
>>> [ 342.190835] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
>>> [ 342.190836] Call Trace:
>>> [ 342.190839] <IRQ>
>>> [ 342.190849] dump_stack+0x46/0x5b
>>> [ 342.190856] __skb_checksum_complete+0x9a/0xa0
>>> [ 342.190859] tcp_v4_rcv+0xef/0x960
>>> [ 342.190864] ip_local_deliver_finish+0x49/0xd0
>>> [ 342.190866] ip_local_deliver+0x5e/0xe0
>>> [ 342.190869] ? ip_sublist_rcv_finish+0x50/0x50
>>> [ 342.190870] ip_rcv+0x41/0xc0
>>> [ 342.190874] __netif_receive_skb_one_core+0x4b/0x70
>>> [ 342.190877] netif_receive_skb_internal+0x2f/0xd0
>>> [ 342.190879] napi_gro_receive+0xb7/0xe0
>>> [ 342.190884] mlx5e_handle_rx_cqe+0x7a/0xd0
>>> [ 342.190886] mlx5e_poll_rx_cq+0xc6/0x930
>>> [ 342.190888] mlx5e_napi_poll+0xab/0xc90
>>
>>
>> We got exactly the same backtrace in our data center. However,
>> it is not easy for us to reproduce it, do you have any clue to reproduce it?
>>
>> If you do, try to tcpdump the packets triggering this warning, it could
>> be useful for debugging.
>>
>> Also, we tried to apply commit d55bef5059dd057bd, the warning _still_
>> occurs. We tried to revert the offending commit 88078d98d1bb, it
>> disappears. So it is likely that commit 88078d98d1bb introduces
>> more troubles than the one fixed by d55bef5059dd057bd.
>>
>
> Or this could be that mlx5 driver is buggy when dealing with VLAN tags.
>
> It both uses vlan_tci (hardware vlan offload) in skb _and_ this piece of code in mlx5e_handle_csum()
>
> if (network_depth > ETH_HLEN)
> /* CQE csum is calculated from the IP header and does
> * not cover VLAN headers (if present). This will add
> * the checksum manually.
> */
> skb->csum = csum_partial(skb->data + ETH_HLEN,
> network_depth - ETH_HLEN,
> skb->csum);
>
>
> That seems strange to me, because skb_vlan_untag() will not adjust skb->csum in this case.
>
Bug might be in NETIF_F_RXFCS mlx5 handling btw...
Code does :
if (unlikely(netdev->features & NETIF_F_RXFCS))
skb->csum = csum_add(skb->csum,
(__force __wsum)mlx5e_get_fcs(skb));
But Dimitris told us that we need to take into account if FCS starts at odd or even offset.
->
if (unlikely(netdev->features & NETIF_F_RXFCS))
skb->csum = csum_block_add(skb->csum,
(__force __wsum)mlx5e_get_fcs(skb),
skb->len);
Powered by blists - more mailing lists