[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c05752ea-0e5f-03df-2a3a-a5110b86cbd9@gmail.com>
Date: Wed, 30 Jan 2019 09:00:48 -0800
From: Eric Dumazet <eric.dumazet@...il.com>
To: Ivan Babrou <ivan@...udflare.com>, netdev@...r.kernel.org
Cc: "David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Ignat Korchagin <ignat@...udflare.com>,
Shawn Bohrer <sbohrer@...udflare.com>,
Jakub Sitnicki <jakub@...udflare.com>
Subject: Re: Crashes in skb clone/allocation in 4.19.18
On 01/30/2019 08:51 AM, Ivan Babrou wrote:
> Hey,
>
> We've upgraded some machines from 4.19.13 to 4.19.18 and some of them
> crashed with the following:
>
> [ 2313.192006] general protection fault: 0000 [#1] SMP PTI
> [ 2313.205924] CPU: 32 PID: 65437 Comm: nginx-fl Tainted: G
> O 4.19.18-cloudflare-2019.1.8 #2019.1.8
> [ 2313.224973] Hardware name: Quanta Computer Inc. QuantaPlex
> T41S-2U/S2S-MB, BIOS S2S_3B10.03 06/21/2018
> [ 2313.243400] RIP: 0010:kmem_cache_alloc_node+0x178/0x1f0
> [ 2313.257768] Code: 89 fa 4c 89 f6 e8 68 40 a1 00 4c 8b 55 00 58 4d
> 85 d2 75 d6 e9 6f ff ff ff 41 8b 59 20 48 8d 4a 01 4c 89 f8 49 8b 39
> 4c 01 fb <48> 33 1b 49 33 99 38 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0
> 0f 84
> [ 2313.295550] RSP: 0000:ffff94457f903b48 EFLAGS: 00010202
> [ 2313.310352] RAX: 08b82daf1f57da0e RBX: 08b82daf1f57da0e RCX: 00000000005ff72d
> [ 2313.327189] RDX: 00000000005ff72c RSI: 0000000000480220 RDI: 0000000000026e40
> [ 2313.344029] RBP: ffff94457f04d680 R08: ffff94457f926e40 R09: ffff94457f04d680
> [ 2313.360912] R10: 000004ce652a0026 R11: 0000000000000000 R12: 0000000000480220
> [ 2313.377857] R13: 00000000ffffffff R14: ffffffffb1ab3ab7 R15: 08b82daf1f57da0e
> [ 2313.394820] FS: 00007fdea755c780(0000) GS:ffff94457f900000(0000)
> knlGS:0000000000000000
> [ 2313.412887] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2313.428581] CR2: 000055acc3cf517b CR3: 000000201b1ea003 CR4: 00000000003606e0
> [ 2313.445753] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 2313.462843] perf: interrupt took too long (8028 > 7291), lowering
> kernel.perf_event_max_sample_rate to 24000
> [ 2313.462867] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 2313.500216] Call Trace:
> [ 2313.512833] <IRQ>
> [ 2313.524748] __alloc_skb+0x57/0x1d0
> [ 2313.537934] __tcp_send_ack.part.48+0x2f/0x100
> [ 2313.551845] tcp_rcv_established+0x550/0x640
> [ 2313.565394] tcp_v4_do_rcv+0x12a/0x1e0
> [ 2313.578322] tcp_v4_rcv+0xadc/0xbd0
> [ 2313.590993] ip_local_deliver_finish+0x5d/0x1d0
> [ 2313.604727] ip_local_deliver+0x6b/0xe0
> [ 2313.617782] ? ip_sublist_rcv+0x200/0x200
> [ 2313.630415] perf: interrupt took too long (10040 > 10035), lowering
> kernel.perf_event_max_sample_rate to 19000
> [ 2313.630948] ip_rcv+0x52/0xd0
> [ 2313.662850] ? ip_rcv_core.isra.22+0x2b0/0x2b0
> [ 2313.662857] __netif_receive_skb_one_core+0x52/0x70
> [ 2313.690860] netif_receive_skb_internal+0x34/0xe0
> [ 2313.690883] efx_rx_deliver+0x11a/0x180 [sfc]
> [ 2313.717780] ? __efx_rx_packet+0x1ef/0x730 [sfc]
> [ 2313.717786] ? __queue_work+0x103/0x3e0
> [ 2313.743118] ? efx_poll+0x35e/0x460 [sfc]
> [ 2313.743125] ? net_rx_action+0x138/0x360
> [ 2313.767356] ? __do_softirq+0xd8/0x2d2
> [ 2313.767362] ? irq_exit+0xb4/0xc0
> [ 2313.790680] ? do_IRQ+0x85/0xd0
> [ 2313.790688] ? common_interrupt+0xf/0xf
> [ 2313.790694] </IRQ>
> [ 2313.823837] Modules linked in: tun xt_connlimit nf_conncount xt_bpf
> xt_hashlimit cls_flow cls_u32 sch_htb sch_fq md_mod dm_crypt
> algif_skcipher af_alg dm_mod dax ip6table_nat nf_nat_ipv6
> ip6table_mangle ip6table_security ip6table_raw xt_nat iptable_nat
> nf_nat_ipv4 nf_nat xt_TPROXY nf_tproxy_ipv6 nf_tproxy_ipv4 xt_connmark
> iptable_mangle xt_owner xt_CT xt_socket nf_socket_ipv4 nf_socket_ipv6
> iptable_raw ip6table_filter ip6_tables nfnetlink_log xt_NFLOG
> xt_tcpudp xt_comment xt_conntrack nf_conntrack nf_defrag_ipv6
> nf_defrag_ipv4 xt_mark xt_multiport xt_set iptable_filter bpfilter
> ip_set_hash_netport ip_set_hash_net ip_set_hash_ip ip_set nfnetlink
> 8021q garp mrp stp llc sb_edac x86_pkg_temp_thermal kvm_intel kvm
> irqbypass crc32_pclmul crc32c_intel pcbc aesni_intel aes_x86_64
> ipmi_ssif crypto_simd cryptd
> [ 2313.952153] sfc(O) glue_helper igb i2c_algo_bit ipmi_si mdio dca
> ipmi_devintf ipmi_msghandler efivarfs ip_tables x_tables
> [ 2313.952238] ---[ end trace 477d8e3081c605f6 ]---
>
> Some nodes also crashed in skb_clone, rather than __alloc_skb:
>
> [ 3810.686137] general protection fault: 0000 [#1] SMP PTI
> [ 3810.694579] CPU: 64 PID: 69338 Comm: nginx-fl Not tainted
> 4.19.18-cloudflare-2019.1.8 #2019.1.8
> [ 3810.706589] Hardware name: Quanta Cloud Technology Inc. QuantaPlex
> T42S-2U(LBG-4) ^S5SZ090028/T42S-2U MB (Lewisburg-4), BIOS 3A11.Q10
> 06/29/2018
> [ 3810.726475] RIP: 0010:kmem_cache_alloc+0x89/0x1c0
> [ 3810.734701] Code: 82 72 49 83 78 10 00 4d 8b 30 0f 84 0e 01 00 00
> 4d 85 f6 0f 84 05 01 00 00 41 8b 5f 20 48 8d 4a 01 4c 89 f0 49 8b 3f
> 4c 01 f3 <48> 33 1b 49 33 9f 38 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0
> 74 b2
> [ 3810.761088] RSP: 0000:ffff99723fe03730 EFLAGS: 00010282
> [ 3810.770132] RAX: f0382d8aebf1ae68 RBX: f0382d8aebf1ae68 RCX: 0000000001cb61cf
> [ 3810.781105] RDX: 0000000001cb61ce RSI: 0000000000480020 RDI: 0000000000027550
> [ 3810.792012] RBP: ffff99723f19d500 R08: ffff99723fe27550 R09: 00000000000005dc
> [ 3810.802820] R10: ffff9992227c0000 R11: 0000000000004000 R12: 0000000000480020
> [ 3810.813589] R13: ffffffff8dcb5f7d R14: f0382d8aebf1ae68 R15: ffff99723f19d500
> [ 3810.824382] FS: 00007f2a8863c780(0000) GS:ffff99723fe00000(0000)
> knlGS:0000000000000000
> [ 3810.836189] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3810.845662] CR2: 000055820762eecd CR3: 00000019eb850003 CR4: 00000000007606e0
> [ 3810.856567] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 3810.867600] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 3810.878554] PKRU: 55555554
> [ 3810.884787] Call Trace:
> [ 3810.890601] <IRQ>
> [ 3810.896116] skb_clone+0x4d/0xb0
> [ 3810.902712] dev_queue_xmit_nit+0xd9/0x260
> [ 3810.910181] dev_hard_start_xmit+0x69/0x1f0
> [ 3810.917784] __dev_queue_xmit+0x6f7/0x8a0
> [ 3810.925172] ? eth_header+0x26/0xc0
> [ 3810.932053] ip_finish_output2+0x193/0x400
> [ 3810.939670] ? ip_finish_output+0x139/0x270
> [ 3810.947241] ip_output+0x6c/0xe0
> [ 3810.953844] ? ip_append_data.part.51+0xc0/0xc0
> [ 3810.961802] __tcp_transmit_skb+0x511/0xaa0
> [ 3810.969420] __tcp_retransmit_skb+0x19c/0x7c0
> [ 3810.977209] ? tcp_current_mss+0x57/0xa0
> [ 3810.984493] tcp_retransmit_skb+0x12/0x80
> [ 3810.991894] tcp_xmit_retransmit_queue.part.50+0x147/0x240
> [ 3811.000754] tcp_ack+0x9c4/0x11b0
> [ 3811.007416] tcp_rcv_established+0x190/0x640
> [ 3811.015065] ? tcp_v4_inbound_md5_hash+0x69/0x160
> [ 3811.023106] tcp_v4_do_rcv+0x12a/0x1e0
> [ 3811.030190] tcp_v4_rcv+0xadc/0xbd0
> [ 3811.037009] ip_local_deliver_finish+0x5d/0x1d0
> [ 3811.044859] ip_local_deliver+0x6b/0xe0
> [ 3811.051999] ? ip_sublist_rcv+0x200/0x200
> [ 3811.059325] ip_rcv+0x52/0xd0
> [ 3811.065595] ? ip_rcv_core.isra.22+0x2b0/0x2b0
> [ 3811.073361] __netif_receive_skb_one_core+0x52/0x70
> [ 3811.081621] netif_receive_skb_internal+0x34/0xe0
> [ 3811.089652] napi_gro_receive+0xba/0xe0
> [ 3811.096969] mlx5e_handle_rx_cqe+0x1eb/0x530 [mlx5_core]
> [ 3811.105545] ? skb_release_head_state+0x5c/0xb0
> [ 3811.113447] mlx5e_poll_rx_cq+0xc8/0x910 [mlx5_core]
> [ 3811.121652] mlx5e_napi_poll+0xb1/0xc60 [mlx5_core]
> [ 3811.129574] net_rx_action+0x138/0x360
> [ 3811.136266] __do_softirq+0xd8/0x2d2
> [ 3811.142679] irq_exit+0xb4/0xc0
> [ 3811.148578] do_IRQ+0x85/0xd0
> [ 3811.154254] common_interrupt+0xf/0xf
> [ 3811.160585] </IRQ>
> [ 3811.165319] RIP: 0033:0x5581e1551ca0
> [ 3811.171546] Code: e8 10 41 ff 24 ee 81 7c ca 04 ff ff fe ff 0f 83
> 87 1c 00 00 8b 03 0f b6 cc 0f b6 e8 83 c3 04 c1 e8 10 41 ff 24 ee 48
> 8b 2c c2 <48> 89 2c ca 8b 03 0f b6 cc 0f b6 e8 83 c3 04 c1 e8 10 41 ff
> 24 ee
> [ 3811.195925] RSP: 002b:00007ffdd615ebc0 EFLAGS: 00000246 ORIG_RAX:
> ffffffffffffffde
> [ 3811.206319] RAX: 0000000000000000 RBX: 00000000406c9058 RCX: 000000000000000b
> [ 3811.216321] RDX: 000000004099cdc8 RSI: fffffffb40c07eb0 RDI: 000000004183d738
> [ 3811.226277] RBP: fffffff444c8c5c0 R08: 000000004099cdc8 R09: 00000000425ce3d8
> [ 3811.236340] R10: 0000000044c8c5c0 R11: 000000004139cbb0 R12: 0000000000000000
> [ 3811.246349] R13: 00005581ead6a9e0 R14: 000000004166afe8 R15: 00000000406c90f8
> [ 3811.256320] Modules linked in: tun xt_connlimit nf_conncount xt_bpf
> xt_hashlimit cls_flow cls_u32 sch_htb sch_fq md_mod dm_crypt
> algif_skcipher af_alg dm_mod dax ip6table_nat nf_nat_ipv6
> ip6table_mangle ip6table_security ip6table_raw ip6table_filter
> ip6_tables xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_TPROXY
> nf_tproxy_ipv6 nf_tproxy_ipv4 xt_connmark iptable_mangle xt_owner
> xt_CT xt_socket nf_socket_ipv4 nf_socket_ipv6 iptable_raw
> nfnetlink_log xt_NFLOG xt_tcpudp xt_comment xt_conntrack nf_conntrack
> nf_defrag_ipv6 nf_defrag_ipv4 xt_mark xt_multiport xt_set
> iptable_filter bpfilter ip_set_hash_netport ip_set_hash_net
> ip_set_hash_ip ip_set nfnetlink 8021q garp mrp stp llc skx_edac
> x86_pkg_temp_thermal kvm_intel kvm irqbypass ipmi_ssif crc32_pclmul
> crc32c_intel pcbc aesni_intel aes_x86_64 crypto_simd mlx5_core
> [ 3811.351698] cryptd xhci_pci tpm_crb mlxfw glue_helper ioatdma
> devlink ipmi_si xhci_hcd dca ipmi_devintf ipmi_msghandler tpm_tis
> tpm_tis_core tpm efivarfs ip_tables x_tables
> [ 3811.375161] ---[ end trace 1a7795bb39a63cf7 ]---
>
> Is this know? Could it be related to this commit:
>
> * https://github.com/torvalds/linux/commit/598e57e029290be3e7f8f87ff908091a5a22ed2f
>
I do not believe this commit could explain these crashes.
Given they are about 580 commits between 4.19.13 and 4.19.18, a bisection might be the easier way
to find the problem.
Thanks.
Powered by blists - more mailing lists