lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Date:   Sun, 18 Nov 2018 15:51:10 +0100
From:   Andre Tomt <andre@...t.net>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     netdev <netdev@...r.kernel.org>,
        Cong Wang <xiyou.wangcong@...il.com>
Subject: Re: hw csum failure + conntrack with more debugging information

On 18.11.2018 02:12, Eric Dumazet wrote:
> 
> 
> On Sat, Nov 17, 2018 at 3:18 PM Andre Tomt <andre@...t.net 
> <mailto:andre@...t.net>> wrote:
> 
>     I added Cong Wang's hw csum failure debug patch to my 4.19.2 tree and
>     got a splat with a bit more information.
> 
>      > [47273.905616] p0xe0: hw csum failure
>      > [47273.905642] dev features: 0x000860c000114bb3
>      > [47273.905663] skb len=44 data_len=0 gso_size=0 gso_type=0
>     ip_summed=2 csum=0, csum_complete_sw=0, csum_valid=0
>      > [47273.905706] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.0-1 #1
>      > [47273.905707] Hardware name: Supermicro Super
>     Server/X10SDV-4C-TLN2F, BIOS 2.0 06/13/2018
>      > [47273.905707] Call Trace:
>      > [47273.905710]  <IRQ>
>      > [47273.905717]  dump_stack+0x5c/0x80
>      > [47273.905721]  __skb_checksum_complete+0xaf/0xc0
>      > [47273.905731]  icmp_error+0x1c8/0x1f0 [nf_conntrack]
>      > [47273.905734]  ? skb_copy_bits+0x13d/0x220
>      > [47273.905740]  nf_conntrack_in+0xd8/0x390 [nf_conntrack]
>      > [47273.905743]  ? ___pskb_trim+0x192/0x330
>      > [47273.905746]  nf_hook_slow+0x43/0xc0
>      > [47273.905749]  ip_rcv+0x90/0xb0
>      > [47273.905752]  ? ip_rcv_finish_core.isra.0+0x310/0x310
>      > [47273.905754]  __netif_receive_skb_one_core+0x42/0x50
>      > [47273.905756]  netif_receive_skb_internal+0x24/0xb0
>      > [47273.905758]  napi_gro_frags+0x177/0x210
>      > [47273.905762]  mlx4_en_process_rx_cq+0x8df/0xb50 [mlx4_en]
>      > [47273.905773]  ? mlx4_eq_int+0x38f/0xcb0 [mlx4_core]
>      > [47273.905776]  mlx4_en_poll_rx_cq+0x55/0xf0 [mlx4_en]
>      > [47273.905778]  net_rx_action+0xe1/0x2c0
>      > [47273.905781]  __do_softirq+0xe7/0x2d3
>      > [47273.905784]  irq_exit+0x96/0xd0
>      > [47273.905786]  do_IRQ+0x85/0xd0
>      > [47273.905790]  common_interrupt+0xf/0xf
>      > [47273.905791]  </IRQ>
>      > [47273.905794] RIP: 0010:cpuidle_enter_state+0xb9/0x320
>      > [47273.905796] Code: e8 3c 15 bc ff 80 7c 24 0b 00 74 17 9c 58 0f
>     1f 44 00 00 f6 c4 02 0f 85 3b 02 00 00 31 ff e8 6e fa c0 ff fb 66 0f
>     1f 44 00 00 <48> b8 ff ff ff ff f3 01 00 00 48 2b 1c 24 ba ff ff ff
>     7f 48 39 c3
>      > [47273.905798] RSP: 0018:ffffb75601943ea8 EFLAGS: 00000246
>     ORIG_RAX: ffffffffffffffdb
>      > [47273.905801] RAX: ffff9d636fa60fc0 RBX: 00002afed059e821 RCX:
>     000000000000001f
>      > [47273.905802] RDX: 00002afed059e821 RSI: 000000003a2ea91a RDI:
>     0000000000000000
>      > [47273.905803] RBP: ffff9d636fa698c8 R08: 0000000000000002 R09:
>     0000000000020840
>      > [47273.905804] R10: 000e97ef158d1e39 R11: ffff9d636fa601e8 R12:
>     0000000000000001
>      > [47273.905805] R13: ffffffffab0ac698 R14: 0000000000000001 R15:
>     0000000000000000
>      > [47273.905808]  ? cpuidle_enter_state+0x94/0x320
>      > [47273.905812]  do_idle+0x1e4/0x220
>      > [47273.905815]  cpu_startup_entry+0x5f/0x70
>      > [47273.905818]  start_secondary+0x185/0x1a0
>      > [47273.905821]  secondary_startup_64+0xa4/0xb0
> 
>     All instances stripped of the identical stack traces:
>      > [13778.531040] dev features: 0x000860c000114bb3
>      > [13778.531056] skb len=40 data_len=0 gso_size=0 gso_type=0
>     ip_summed=2 csum=0, csum_complete_sw=0, csum_valid=0
>      > [13778.531176] dev features: 0x000860c000114bb3
>      > [13778.531204] skb len=40 data_len=0 gso_size=0 gso_type=0
>     ip_summed=2 csum=0, csum_complete_sw=0, csum_valid=0
>      > [13778.531256] dev features: 0x000860c000114bb3
>      > [13778.531285] skb len=40 data_len=0 gso_size=0 gso_type=0
>     ip_summed=2 csum=0, csum_complete_sw=0, csum_valid=0 >
>     [47273.905642] dev features: 0x000860c000114bb3
>      > [47273.905663] skb len=44 data_len=0 gso_size=0 gso_type=0
>     ip_summed=2 csum=0, csum_complete_sw=0, csum_valid=0
> 
>     The setup has also further been simplified by also removing vlans and
>     6to4 tunnels, It's now only conntrack and nat (configured with
>     nftables)
>     on bare ethernet netdevs.
> 
>     offloads, ring sizes etc is left at defaults,
>     net.ipv4.ip_early_demux is
>     off, fq_codel as net.core.default_qdisc
> 
>     Hardware is ConnectX-3 VPI 2xQSFP+ (firmware 2.42.5000) on a quad core
>     Xeon D-1521, passing traffic from port 1 to port 2 on the same card.
>     Last switch to touch the packets is an Arista DCS-7050QX-32 running EOS
>     4.20.2.1F
> 
>     This kernel build contains some other bits and pieces from net.git
>     (mostly things queued for stable) and a couple of backports from
>     net-next (Aaron Lu's pcp page recycling fix, Eric's BQL+mlx4
>     optimizations), but the stack traces are identical to before so they
>     dont seem involved in this.
> 
>     Workload remains nearly exclusively TCP and UDP torrent junk traffic to
>     two machines behind it.
> 
> 
> 
> Please try this patch, we suspect mlx4 support for CHECKSUM_COMPLETE is 
> wrong.
> 
> (Only IPv4 handled, but I suspect a similar fix is needed for IPv6)

Testing it now. Can sometimes take a few days to hit here so will 
probably have to leave it running for a while.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ