[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0Udps5K1Eu6vmsEW1ABnYOK3DgTR1bTGBMuTh5nsiQgZ3g@mail.gmail.com>
Date: Mon, 20 Nov 2017 14:56:47 -0800
From: Alexander Duyck <alexander.duyck@...il.com>
To: Sarah Newman <sarah.newman@...puter.org>
Cc: e1000-devel@...ts.sf.net, Netdev <netdev@...r.kernel.org>
Subject: Re: [E1000-devel] Questions about crashes and GRO
On Mon, Nov 20, 2017 at 2:38 PM, Sarah Newman <sarah.newman@...puter.org> wrote:
> On 11/20/2017 08:36 AM, Alexander Duyck wrote:
>> Hi Sarah,
>>
>> I am adding the netdev mailing list as I am not certain this is an
>> i350 specific issue. The traces themselves aren't anything I recognize
>> as an existing issue. From what I can tell it looks like you are
>> running Xen, so would I be correct in assuming you are bridging
>> between VMs? If so are you using any sort of tunnels on your network,
>> if so what type? This information would be useful as we may be looking
>> at a bug in a tunnel offload for GRO.
>
> Yes, there's bridging. The traffic on the physical device is tagged with vlans and the bridges use untagged traffic. There are no tunnels. I do not
> own the VMs traffic.
>
> Because I have only seen this on a single server with unique hardware, I think it's most likely related to the hardware or to a particular VM on that
> server.
So I would suspect traffic coming from the VM if anything. The i350 is
a pretty common device. If we were seeing issues specific to it I
would expect we would have more reports than just the one so far.
>>
>> On Fri, Nov 17, 2017 at 3:28 PM, Sarah Newman <sarah.newman@...puter.org> wrote:
>>> Hi,
>>>
>>> I have an X10 supermicro with two I350's that has crashed twice now under v4.9.39 within the last 3 weeks, with no crashes before v4.9.39:
>>
>> What was the last kernel you tested before v4.9.39? Just wondering as
>> it will help to rule out certain patches as possibly being the issue.
>
> 4.9.31.
>
> If the problem is related to a particular VM, then I don't think the last known good kernel is necessarily pertinent, as the problematic traffic could
> have started at any time.
>
>>> I see in the release notes https://downloadmirror.intel.com/22919/eng/README.txt " Do Not Use LRO When Routing Packets."
>>>
>>> We are bridging traffic, not routing, and the crashes are in the GRO code.
>>>
>>> Is it possible there are problems with GRO for bridging in the igb driver now? If I disable GRO can I have some confidence it will fix the issue?
>>
>> As far as LRO not being used when routing, just so you know LRO and
>> GRO are two very different things. One of the issues with LRO is that
>> it wasn't reversible in some cases and so could lead to the packet
>> being changed if they were rerouted. With GRO that shouldn't be the
>> case as we should be able to get back out the original packets that
>> were put into a frame. So there shouldn't be any issues using GRO with
>> bridging or routing.
>
> In some very old release notes for the ixgbe https://downloadmirror.intel.com/22919/eng/README.txt it said to disable GRO for bridging/routing, and it
> wasn't clear it was not specific to the driver. I didn't originally notice how old the release notes were and that the notice was removed in newer
> versions, I apologize.
>
>>> First crash:
>>>
>>> [4083386.299221] ------------[ cut here ]------------
>>> [4083386.299358] WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:1473 inet_gro_complete+0xbb/0xd0
>>> [4083386.299520] Modules linked in: sb_edac edac_core 8021q mrp garp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev ip6table_filter
>>> ip6_tables xen_pciback blktap xen_netback xen_gntdev xen_gnt
>>> alloc xenfs xen_privcmd xen_evtchn xen_blkback tun sch_htb fuse ext2 ebt_mark ebt_ip ebt_arp ebtable_filter ebtables drbd lru_cache cls_fw
>>> br_netfilter bridge stp llc iTCO_wdt iTCO_vendor_support pcspkr raid456 async_raid6_recov async_pq
>>> async_xor xor async_memcpy async_tx raid10 raid6_pq libcrc32c joydev shpchp i2c_i801 i2c_smbus mei_me mei lpc_ich fjes ipmi_si ipmi_msghandler
>>> acpi_power_meter ioatdma igb dca raid1 mlx4_en mlx4_ib ib_core ptp pps_core mlx4_core mpt3sas
>>> scsi_transport_sas raid_class wmi ast ttm
>>> [4083386.300888] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.39 #1
>>> [4083386.301002] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 2.0a 09/16/2016
>>> [4083386.301109] ffff880306603d90 ffffffff813f5935 0000000000000000 0000000000000000
>>> [4083386.301221] ffff880306603dd0 ffffffff810a7e01 000005c18174578a ffff8802f94a9a00
>>> [4083386.301333] ffff8802f0824450 0000000000000000 0000000000000040 0000000000000040
>>> [4083386.301445] Call Trace:
>>> [4083386.301483] <IRQ> [4083386.301519] dump_stack+0x63/0x8e
>>> [4083386.301596] __warn+0xd1/0xf0
>>> [4083386.301665] warn_slowpath_null+0x1d/0x20
>>> [4083386.301747] inet_gro_complete+0xbb/0xd0
>>> [4083386.301830] napi_gro_complete+0x73/0xa0
>>> [4083386.301911] napi_gro_flush+0x5f/0x80
>>> [4083386.301988] napi_complete_done+0x6a/0xb0
>>> [4083386.302075] igb_poll+0x38d/0x720 [igb]
>>> [4083386.302156] ? igb_msix_ring+0x2e/0x40 [igb]
>>> [4083386.302255] ? __handle_irq_event_percpu+0x4b/0x1a0
>>> [4083386.302349] net_rx_action+0x158/0x360
>>> [4083386.302430] __do_softirq+0xd1/0x283
>>> [4083386.302507] irq_exit+0xe9/0x100
>>> [4083386.302580] xen_evtchn_do_upcall+0x35/0x50
>>> [4083386.302665] xen_do_hypervisor_callback+0x1e/0x40
>>> [4083386.302754] <EOI> [4083386.302787] ? xen_hypercall_sched_op+0xa/0x20
>>> [4083386.302876] ? xen_hypercall_sched_op+0xa/0x20
>>> [4083386.302965] ? xen_safe_halt+0x10/0x20
>>> [4083386.303043] ? default_idle+0x1e/0xd0
>>> [4083386.303122] ? arch_cpu_idle+0xf/0x20
>>> [4083386.303200] ? default_idle_call+0x2c/0x40
>>> [4083386.303284] ? cpu_startup_entry+0x1ac/0x240
>>> [4083386.303370] ? rest_init+0x77/0x80
>>> [4083386.303462] ? start_kernel+0x4a7/0x4b4
>>> [4083386.303568] ? set_init_arg+0x55/0x55
>>> [4083386.303670] ? x86_64_start_reservations+0x24/0x26
>>> [4083386.303776] ? xen_start_kernel+0x555/0x561
>>> [4083386.303873] ---[ end trace 8294f59ced689507 ]---
I think this first trace is more important than the one below.
Specifically it calls out GRO assembly issues with there being either
a lack of GRO ops or no gro_complete function for whatever protocol
was found in the packet.
>>> [4083386.303958] general protection fault: 0000 [#1] SMP
>>> [4083386.304041] Modules linked in: sb_edac edac_core 8021q mrp garp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev ip6table_filter
>>> ip6_tables xen_pciback blktap xen_netback xen_gntdev xen_gntalloc xenfs xen_privcmd xe
>>> n_evtchn xen_blkback tun sch_htb fuse ext2 ebt_mark ebt_ip ebt_arp ebtable_filter ebtables drbd lru_cache cls_fw br_netfilter bridge stp llc iTCO_wdt
>>> iTCO_vendor_support pcspkr raid456 async_raid6_recov async_pq async_xor xor async_memcp
>>> y async_tx raid10 raid6_pq libcrc32c joydev shpchp i2c_i801 i2c_smbus mei_me mei lpc_ich fjes ipmi_si ipmi_msghandler acpi_power_meter ioatdma igb dca
>>> raid1 mlx4_en mlx4_ib ib_core ptp pps_core mlx4_core mpt3sas scsi_transport_sas raid_c
>>> lass wmi ast ttm
>>> [4083386.305179] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.9.39 #1
>>> [4083386.305307] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 2.0a 09/16/2016
>>> [4083386.305414] task: ffffffff81e0e540 task.stack: ffffffff81e00000
>>> [4083386.305498] RIP: e030: skb_release_data+0x73/0xf0
>>> [4083386.305617] RSP: e02b:ffff880306603d90 EFLAGS: 00010206
>>> [4083386.305692] RAX: 0000000000000030 RBX: f5b36db76bd162c7 RCX: ffffffff81e60048
>>> [4083386.305790] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8802f94a9a00
>>> [4083386.305887] RBP: ffff880306603db0 R08: 0000000000004277 R09: 0000000000000000
>>> [4083386.305985] R10: 0000000000000005 R11: 0000000000000002 R12: 0000000000000000
>>> [4083386.306083] R13: ffff8802f94a9a00 R14: ffff88032f527740 R15: 0000000000000040
>>> [4083386.306186] FS: 0000000000000000(0000) GS:ffff880306600000(0000) knlGS:0000000000000000
>>> [4083386.306296] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [4083386.306407] CR2: 0000000001692ed8 CR3: 000000022b3c9000 CR4: 0000000000042660
>>> [4083386.306505] Stack:
>>> [4083386.306537] ffff8802f94a9a00 ffff8802f94a9a00 ffffffff8175ac3e 0000000000000040
>>> [4083386.306649] ffff880306603dc8 ffffffff81745764 ffff8802f94a9a00 ffff880306603df0
>>> [4083386.306762] ffffffff817457c2 ffff8802f94a9a00 ffff8802f0824450 0000000000000000
>>> [4083386.306874] Call Trace:
>>> [4083386.306911] <IRQ> [4083386.306944] ? napi_gro_complete+0x5e/0xa0
>>> [4083386.307038] skb_release_all+0x24/0x30
>>> [4083386.307133] kfree_skb+0x32/0x90
>>> [4083386.307206] napi_gro_complete+0x5e/0xa0
>>> [4083386.307287] napi_gro_flush+0x5f/0x80
>>> [4083386.307365] napi_complete_done+0x6a/0xb0
>>> [4083386.307449] igb_poll+0x38d/0x720 [igb]
>>> [4083386.307530] ? igb_msix_ring+0x2e/0x40 [igb]
>>> [4083386.307617] ? __handle_irq_event_percpu+0x4b/0x1a0
>>> [4083386.307720] net_rx_action+0x158/0x360
>>> [4083386.307800] __do_softirq+0xd1/0x283
>>> [4083386.307877] irq_exit+0xe9/0x100
>>> [4083386.307949] xen_evtchn_do_upcall+0x35/0x50
>>> [4083386.308034] xen_do_hypervisor_callback+0x1e/0x40
>>> [4083386.308124] <EOI> [4083386.308156] ? xen_hypercall_sched_op+0xa/0x20
>>> [4083386.308246] ? xen_hypercall_sched_op+0xa/0x20
>>> [4083386.308334] ? xen_safe_halt+0x10/0x20
>>> [4083386.308413] ? default_idle+0x1e/0xd0
>>> [4083386.308491] ? arch_cpu_idle+0xf/0x20
>>> [4083386.308568] ? default_idle_call+0x2c/0x40
>>> [4083386.308651] ? cpu_startup_entry+0x1ac/0x240
>>> [4083386.308737] ? rest_init+0x77/0x80
>>> [4083386.308811] ? start_kernel+0x4a7/0x4b4
>>> [4083386.308890] ? set_init_arg+0x55/0x55
>>> [4083386.308968] ? x86_64_start_reservations+0x24/0x26
>>> [4083386.309060] ? xen_start_kernel+0x555/0x561
>>> [4083386.309144] Code: f0 41 0f c1 46 20 39 c2 74 09 5b 41 5c 41 5d 41 5e 5d c3 45 31 e4 41 80 3e 00 74 39 49 63 c4 48 83 c0 03 48 c1 e0 04 49 8b 1c
>>> 06 <48> 8b 43 20 a8 01 75 6f f0 ff 4b 1c 74 55 48 8b 03 48 c1 e8 33
>>> [4083386.309571] RIP skb_release_data+0x73/0xf0
>>> [4083386.309658] RSP <ffff880306603d90>
>>> [4083386.313000] ---[ end trace 8294f59ced689508 ]---
>>> [4083386.389667] Kernel panic - not syncing: Fatal exception in interrupt
>>> [4083386.389791] Kernel Offset: disabled
>>> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
>
> Output of addr2line for address of skb_release_data+0x73 is
>
> __read_once_size
> include/linux/compiler.h:243 (discriminator 2)
> compound_head
> include/linux/page-flags.h:143 (discriminator 2)
> put_page
> include/linux/mm.h:777 (discriminator 2)
> __skb_frag_unref
> include/linux/skbuff.h:2592 (discriminator 2)
> skb_release_data
> net/core/skbuff.c:594 (discriminator 2)
>
> skbuff.c:594 is:
>
> __skb_frag_unref(&shinfo->frags[i]);
>
> Actual assembly is:
> <+91>: xor %r12d,%r12d
> <+94>: cmpb $0x0,(%r14)
> <+98>: je <skb_release_data+157>
> <+100>: movslq %r12d,%rax
> <+103>: add $0x3,%rax
> <+107>: shl $0x4,%rax
> <+111>: mov (%r14,%rax,1),%rbx
> <+115>: mov 0x20(%rbx),%rax <------ this is skb_release_data+0x73
> <+119>: test $0x1,%al
> <+121>: jne <skb_release_data+234>
>
> rbx is f5b36db76bd162c7, which seems like garbage. I don't know if this looks like any particular garbage.
>
>>> Second crash:
>>>
>>> [1838269.012349] general protection fault: 0000 [#1] SMP
>>> [1838269.012452] Modules linked in: ebtable_nat sb_edac edac_core 8021q mrp garp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev
>>> ip6table_filter ip6_tables xen_pciback blktap xen_netback xen_gntdev xen_gntalloc xenfs xe
>>> n_privcmd xen_evtchn xen_blkback tun sch_htb fuse ext2 ebt_mark ebt_ip ebt_arp ebtable_filter ebtables drbd lru_cache cls_fw br_netfilter bridge stp
>>> llc iTCO_wdt iTCO_vendor_support pcspkr raid456 async_raid6_recov async_pq async_xor xor
>>> async_memcpy async_tx raid10 raid6_pq libcrc32c joydev i2c_i801 i2c_smbus lpc_ich shpchp mei_me mei fjes ipmi_si ipmi_msghandler acpi_power_meter
>>> ioatdma igb dca raid1 mlx4_en mlx4_ib ib_core ptp pps_core mlx4_core mpt3sas scsi_transpor
>>> t_sas raid_class wmi ast ttm
>>> [1838269.013521] CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.9.39 #1
>>> [1838269.013637] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 2.0a 09/16/2016
>>> [1838269.013743] task: ffff88030008c4c0 task.stack: ffffc90041978000
>>> [1838269.013826] RIP: e030: memcpy_erms+0x6/0x10
>>> [1838269.013952] RSP: e02b:ffffc9004197bac0 EFLAGS: 00010202
>>> [1838269.014026] RAX: ffff88032fcafe16 RBX: 0000000000000004 RCX: 0000000000000004
>>> [1838269.014124] RDX: 0000000000000004 RSI: 62a16ddedc6dbcb3 RDI: ffff88032fcafe16
>>> [1838269.014222] RBP: ffffc9004197bb20 R08: 0000000000000004 R09: 0000000000000004
>>> [1838269.014320] R10: ffff88026ae89500 R11: 0000000044639632 R12: 0000000000000048
>>> [1838269.014417] R13: 0000000000000000 R14: 0000000044639632 R15: 0000000000000048
>>> [1838269.014519] FS: 0000000000000000(0000) GS:ffff880306640000(0000) knlGS:ffff880306640000
>>> [1838269.014629] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [1838269.014709] CR2: ffffffffff600400 CR3: 0000000051939000 CR4: 0000000000042660
>>> [1838269.014808] Stack:
>>> [1838269.014840] ffffffff81744c17 ffff88026ae89500 0000000044639632 ffff88030008c4c0
>>> [1838269.014952] ffffffff00000004 0000000000000004 ffff88032fcafe16 ffff88026ae89500
>>> [1838269.015064] 0000000000000004 0000000000000004 000000000000004c 0000000000000028
>>> [1838269.015176] Call Trace:
>>> [1838269.015217] ? skb_copy_bits+0x137/0x2c0
>>> [1838269.015299] __pskb_pull_tail+0x7f/0x3b0
>>> [1838269.015382] tcp_gro_receive+0x2c5/0x300
>>> [1838269.015465] tcp6_gro_receive+0x13a/0x1a0
>>> [1838269.015547] ipv6_gro_receive+0x1c6/0x380
>>> [1838269.015630] dev_gro_receive+0x269/0x3b0
>>> [1838269.015712] napi_gro_receive+0x38/0xf0
>>> [1838269.015796] igb_clean_rx_irq+0x38e/0x690 [igb]
>>> [1838269.015886] igb_poll+0x362/0x720 [igb]
>>> [1838269.015968] ? dequeue_entity+0x26e/0xa90
>>> [1838269.016051] ? xen_mc_flush+0x17b/0x1b0
>>> [1838269.016131] net_rx_action+0x158/0x360
>>> [1838269.016212] __do_softirq+0xd1/0x283
>>> [1838269.016290] ? sort_range+0x30/0x30
>>> [1838269.016366] run_ksoftirqd+0x29/0x50
>>> [1838269.016443] smpboot_thread_fn+0x110/0x160
>>> [1838269.016525] kthread+0xd7/0xf0
>>> [1838269.016595] ? kthread_park+0x60/0x60
>>> [1838269.016673] ret_from_fork+0x25/0x30
>>> [1838269.016758] Code: ff 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89
>>> d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38
>>> [1838269.017183] RIP memcpy_erms+0x6/0x10
>>> [1838269.017264] RSP <ffffc9004197bac0>
>>> [1838269.020618] ---[ end trace 3506ce1d7200529a ]---
>>> [1838269.079891] Kernel panic - not syncing: Fatal exception in interrupt
>>> [1838269.080014] Kernel Offset: disabled
>>> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
>
> --Sarah
Powered by blists - more mailing lists