lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0Udps5K1Eu6vmsEW1ABnYOK3DgTR1bTGBMuTh5nsiQgZ3g@mail.gmail.com>
Date:   Mon, 20 Nov 2017 14:56:47 -0800
From:   Alexander Duyck <alexander.duyck@...il.com>
To:     Sarah Newman <sarah.newman@...puter.org>
Cc:     e1000-devel@...ts.sf.net, Netdev <netdev@...r.kernel.org>
Subject: Re: [E1000-devel] Questions about crashes and GRO

On Mon, Nov 20, 2017 at 2:38 PM, Sarah Newman <sarah.newman@...puter.org> wrote:
> On 11/20/2017 08:36 AM, Alexander Duyck wrote:
>> Hi Sarah,
>>
>> I am adding the netdev mailing list as I am not certain this is an
>> i350 specific issue. The traces themselves aren't anything I recognize
>> as an existing issue. From what I can tell it looks like you are
>> running Xen, so would I be correct in assuming you are bridging
>> between VMs? If so are you using any sort of tunnels on your network,
>> if so what type? This information would be useful as we may be looking
>> at a bug in a tunnel offload for GRO.
>
> Yes, there's bridging. The traffic on the physical device is tagged with vlans and the bridges use untagged traffic. There are no tunnels. I do not
> own the VMs traffic.
>
> Because I have only seen this on a single server with unique hardware, I think it's most likely related to the hardware or to a particular VM on that
> server.

So I would suspect traffic coming from the VM if anything. The i350 is
a pretty common device. If we were seeing issues specific to it I
would expect we would have more reports than just the one so far.

>>
>> On Fri, Nov 17, 2017 at 3:28 PM, Sarah Newman <sarah.newman@...puter.org> wrote:
>>> Hi,
>>>
>>> I have an X10 supermicro with two I350's that has crashed twice now under v4.9.39 within the last 3 weeks, with no crashes before v4.9.39:
>>
>> What was the last kernel you tested before v4.9.39? Just wondering as
>> it will help to rule out certain patches as possibly being the issue.
>
> 4.9.31.
>
> If the problem is related to a particular VM, then I don't think the last known good kernel is necessarily pertinent, as the problematic traffic could
> have started at any time.
>
>>> I see in the release notes https://downloadmirror.intel.com/22919/eng/README.txt " Do Not Use LRO When Routing Packets."
>>>
>>> We are bridging traffic, not routing, and the crashes are in the GRO code.
>>>
>>> Is it possible there are problems with GRO for bridging in the igb driver now? If I disable GRO can I have some confidence it will fix the issue?
>>
>> As far as LRO not being used when routing, just so you know LRO and
>> GRO are two very different things. One of the issues with LRO is that
>> it wasn't reversible in some cases and so could lead to the packet
>> being changed if they were rerouted. With GRO that shouldn't be the
>> case as we should be able to get back out the original packets that
>> were put into a frame. So there shouldn't be any issues using GRO with
>> bridging or routing.
>
> In some very old release notes for the ixgbe https://downloadmirror.intel.com/22919/eng/README.txt it said to disable GRO for bridging/routing, and it
> wasn't clear it was not specific to the driver. I didn't originally notice how old the release notes were and that the notice was removed in newer
> versions, I apologize.
>
>>> First crash:
>>>
>>> [4083386.299221] ------------[ cut here ]------------
>>> [4083386.299358] WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:1473 inet_gro_complete+0xbb/0xd0
>>> [4083386.299520] Modules linked in: sb_edac edac_core 8021q mrp garp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev ip6table_filter
>>> ip6_tables xen_pciback blktap xen_netback xen_gntdev xen_gnt
>>> alloc xenfs xen_privcmd xen_evtchn xen_blkback tun sch_htb fuse ext2 ebt_mark ebt_ip ebt_arp ebtable_filter ebtables drbd lru_cache cls_fw
>>> br_netfilter bridge stp llc iTCO_wdt iTCO_vendor_support pcspkr raid456 async_raid6_recov async_pq
>>>  async_xor xor async_memcpy async_tx raid10 raid6_pq libcrc32c joydev shpchp i2c_i801 i2c_smbus mei_me mei lpc_ich fjes ipmi_si ipmi_msghandler
>>> acpi_power_meter ioatdma igb dca raid1 mlx4_en mlx4_ib ib_core ptp pps_core mlx4_core mpt3sas
>>>  scsi_transport_sas raid_class wmi ast ttm
>>> [4083386.300888] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.39 #1
>>> [4083386.301002] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 2.0a 09/16/2016
>>> [4083386.301109]  ffff880306603d90 ffffffff813f5935 0000000000000000 0000000000000000
>>> [4083386.301221]  ffff880306603dd0 ffffffff810a7e01 000005c18174578a ffff8802f94a9a00
>>> [4083386.301333]  ffff8802f0824450 0000000000000000 0000000000000040 0000000000000040
>>> [4083386.301445] Call Trace:
>>> [4083386.301483]  <IRQ> [4083386.301519]   dump_stack+0x63/0x8e
>>> [4083386.301596]   __warn+0xd1/0xf0
>>> [4083386.301665]   warn_slowpath_null+0x1d/0x20
>>> [4083386.301747]   inet_gro_complete+0xbb/0xd0
>>> [4083386.301830]   napi_gro_complete+0x73/0xa0
>>> [4083386.301911]   napi_gro_flush+0x5f/0x80
>>> [4083386.301988]   napi_complete_done+0x6a/0xb0
>>> [4083386.302075]   igb_poll+0x38d/0x720 [igb]
>>> [4083386.302156]   ? igb_msix_ring+0x2e/0x40 [igb]
>>> [4083386.302255]   ? __handle_irq_event_percpu+0x4b/0x1a0
>>> [4083386.302349]   net_rx_action+0x158/0x360
>>> [4083386.302430]   __do_softirq+0xd1/0x283
>>> [4083386.302507]   irq_exit+0xe9/0x100
>>> [4083386.302580]   xen_evtchn_do_upcall+0x35/0x50
>>> [4083386.302665]   xen_do_hypervisor_callback+0x1e/0x40
>>> [4083386.302754]  <EOI> [4083386.302787]   ? xen_hypercall_sched_op+0xa/0x20
>>> [4083386.302876]   ? xen_hypercall_sched_op+0xa/0x20
>>> [4083386.302965]   ? xen_safe_halt+0x10/0x20
>>> [4083386.303043]   ? default_idle+0x1e/0xd0
>>> [4083386.303122]   ? arch_cpu_idle+0xf/0x20
>>> [4083386.303200]   ? default_idle_call+0x2c/0x40
>>> [4083386.303284]   ? cpu_startup_entry+0x1ac/0x240
>>> [4083386.303370]   ? rest_init+0x77/0x80
>>> [4083386.303462]   ? start_kernel+0x4a7/0x4b4
>>> [4083386.303568]   ? set_init_arg+0x55/0x55
>>> [4083386.303670]   ? x86_64_start_reservations+0x24/0x26
>>> [4083386.303776]   ? xen_start_kernel+0x555/0x561
>>> [4083386.303873] ---[ end trace 8294f59ced689507 ]---

I think this first trace is more important than the one below.
Specifically it calls out GRO assembly issues with there being either
a lack of GRO ops or no gro_complete function for whatever protocol
was found in the packet.

>>> [4083386.303958] general protection fault: 0000 [#1] SMP
>>> [4083386.304041] Modules linked in: sb_edac edac_core 8021q mrp garp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev ip6table_filter
>>> ip6_tables xen_pciback blktap xen_netback xen_gntdev xen_gntalloc xenfs xen_privcmd xe
>>> n_evtchn xen_blkback tun sch_htb fuse ext2 ebt_mark ebt_ip ebt_arp ebtable_filter ebtables drbd lru_cache cls_fw br_netfilter bridge stp llc iTCO_wdt
>>> iTCO_vendor_support pcspkr raid456 async_raid6_recov async_pq async_xor xor async_memcp
>>> y async_tx raid10 raid6_pq libcrc32c joydev shpchp i2c_i801 i2c_smbus mei_me mei lpc_ich fjes ipmi_si ipmi_msghandler acpi_power_meter ioatdma igb dca
>>> raid1 mlx4_en mlx4_ib ib_core ptp pps_core mlx4_core mpt3sas scsi_transport_sas raid_c
>>> lass wmi ast ttm
>>> [4083386.305179] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       4.9.39 #1
>>> [4083386.305307] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 2.0a 09/16/2016
>>> [4083386.305414] task: ffffffff81e0e540 task.stack: ffffffff81e00000
>>> [4083386.305498] RIP: e030:   skb_release_data+0x73/0xf0
>>> [4083386.305617] RSP: e02b:ffff880306603d90  EFLAGS: 00010206
>>> [4083386.305692] RAX: 0000000000000030 RBX: f5b36db76bd162c7 RCX: ffffffff81e60048
>>> [4083386.305790] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8802f94a9a00
>>> [4083386.305887] RBP: ffff880306603db0 R08: 0000000000004277 R09: 0000000000000000
>>> [4083386.305985] R10: 0000000000000005 R11: 0000000000000002 R12: 0000000000000000
>>> [4083386.306083] R13: ffff8802f94a9a00 R14: ffff88032f527740 R15: 0000000000000040
>>> [4083386.306186] FS:  0000000000000000(0000) GS:ffff880306600000(0000) knlGS:0000000000000000
>>> [4083386.306296] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [4083386.306407] CR2: 0000000001692ed8 CR3: 000000022b3c9000 CR4: 0000000000042660
>>> [4083386.306505] Stack:
>>> [4083386.306537]  ffff8802f94a9a00 ffff8802f94a9a00 ffffffff8175ac3e 0000000000000040
>>> [4083386.306649]  ffff880306603dc8 ffffffff81745764 ffff8802f94a9a00 ffff880306603df0
>>> [4083386.306762]  ffffffff817457c2 ffff8802f94a9a00 ffff8802f0824450 0000000000000000
>>> [4083386.306874] Call Trace:
>>> [4083386.306911]  <IRQ> [4083386.306944]   ? napi_gro_complete+0x5e/0xa0
>>> [4083386.307038]   skb_release_all+0x24/0x30
>>> [4083386.307133]   kfree_skb+0x32/0x90
>>> [4083386.307206]   napi_gro_complete+0x5e/0xa0
>>> [4083386.307287]   napi_gro_flush+0x5f/0x80
>>> [4083386.307365]   napi_complete_done+0x6a/0xb0
>>> [4083386.307449]   igb_poll+0x38d/0x720 [igb]
>>> [4083386.307530]   ? igb_msix_ring+0x2e/0x40 [igb]
>>> [4083386.307617]   ? __handle_irq_event_percpu+0x4b/0x1a0
>>> [4083386.307720]   net_rx_action+0x158/0x360
>>> [4083386.307800]   __do_softirq+0xd1/0x283
>>> [4083386.307877]   irq_exit+0xe9/0x100
>>> [4083386.307949]   xen_evtchn_do_upcall+0x35/0x50
>>> [4083386.308034]   xen_do_hypervisor_callback+0x1e/0x40
>>> [4083386.308124]  <EOI> [4083386.308156]   ? xen_hypercall_sched_op+0xa/0x20
>>> [4083386.308246]   ? xen_hypercall_sched_op+0xa/0x20
>>> [4083386.308334]   ? xen_safe_halt+0x10/0x20
>>> [4083386.308413]   ? default_idle+0x1e/0xd0
>>> [4083386.308491]   ? arch_cpu_idle+0xf/0x20
>>> [4083386.308568]   ? default_idle_call+0x2c/0x40
>>> [4083386.308651]   ? cpu_startup_entry+0x1ac/0x240
>>> [4083386.308737]   ? rest_init+0x77/0x80
>>> [4083386.308811]   ? start_kernel+0x4a7/0x4b4
>>> [4083386.308890]   ? set_init_arg+0x55/0x55
>>> [4083386.308968]   ? x86_64_start_reservations+0x24/0x26
>>> [4083386.309060]   ? xen_start_kernel+0x555/0x561
>>> [4083386.309144] Code: f0 41 0f c1 46 20 39 c2 74 09 5b 41 5c 41 5d 41 5e 5d c3 45 31 e4 41 80 3e 00 74 39 49 63 c4 48 83 c0 03 48 c1 e0 04 49 8b 1c
>>> 06 <48> 8b 43 20 a8 01 75 6f f0 ff 4b 1c 74 55 48 8b 03 48 c1 e8 33
>>> [4083386.309571] RIP   skb_release_data+0x73/0xf0
>>> [4083386.309658]  RSP <ffff880306603d90>
>>> [4083386.313000] ---[ end trace 8294f59ced689508 ]---
>>> [4083386.389667] Kernel panic - not syncing: Fatal exception in interrupt
>>> [4083386.389791] Kernel Offset: disabled
>>> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
>
> Output of addr2line for address of skb_release_data+0x73 is
>
> __read_once_size
> include/linux/compiler.h:243 (discriminator 2)
> compound_head
> include/linux/page-flags.h:143 (discriminator 2)
> put_page
> include/linux/mm.h:777 (discriminator 2)
> __skb_frag_unref
> include/linux/skbuff.h:2592 (discriminator 2)
> skb_release_data
> net/core/skbuff.c:594 (discriminator 2)
>
> skbuff.c:594 is:
>
> __skb_frag_unref(&shinfo->frags[i]);
>
> Actual assembly is:
> <+91>:  xor    %r12d,%r12d
> <+94>:  cmpb   $0x0,(%r14)
> <+98>:  je     <skb_release_data+157>
> <+100>: movslq %r12d,%rax
> <+103>: add    $0x3,%rax
> <+107>: shl    $0x4,%rax
> <+111>: mov    (%r14,%rax,1),%rbx
> <+115>: mov    0x20(%rbx),%rax <------ this is skb_release_data+0x73
> <+119>: test   $0x1,%al
> <+121>: jne   <skb_release_data+234>
>
> rbx is f5b36db76bd162c7, which seems like garbage. I don't know if this looks like any particular garbage.
>
>>> Second crash:
>>>
>>> [1838269.012349] general protection fault: 0000 [#1] SMP
>>> [1838269.012452] Modules linked in: ebtable_nat sb_edac edac_core 8021q mrp garp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev
>>> ip6table_filter ip6_tables xen_pciback blktap xen_netback xen_gntdev xen_gntalloc xenfs xe
>>> n_privcmd xen_evtchn xen_blkback tun sch_htb fuse ext2 ebt_mark ebt_ip ebt_arp ebtable_filter ebtables drbd lru_cache cls_fw br_netfilter bridge stp
>>> llc iTCO_wdt iTCO_vendor_support pcspkr raid456 async_raid6_recov async_pq async_xor xor
>>>  async_memcpy async_tx raid10 raid6_pq libcrc32c joydev i2c_i801 i2c_smbus lpc_ich shpchp mei_me mei fjes ipmi_si ipmi_msghandler acpi_power_meter
>>> ioatdma igb dca raid1 mlx4_en mlx4_ib ib_core ptp pps_core mlx4_core mpt3sas scsi_transpor
>>> t_sas raid_class wmi ast ttm
>>> [1838269.013521] CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.9.39 #1
>>> [1838269.013637] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 2.0a 09/16/2016
>>> [1838269.013743] task: ffff88030008c4c0 task.stack: ffffc90041978000
>>> [1838269.013826] RIP: e030:   memcpy_erms+0x6/0x10
>>> [1838269.013952] RSP: e02b:ffffc9004197bac0  EFLAGS: 00010202
>>> [1838269.014026] RAX: ffff88032fcafe16 RBX: 0000000000000004 RCX: 0000000000000004
>>> [1838269.014124] RDX: 0000000000000004 RSI: 62a16ddedc6dbcb3 RDI: ffff88032fcafe16
>>> [1838269.014222] RBP: ffffc9004197bb20 R08: 0000000000000004 R09: 0000000000000004
>>> [1838269.014320] R10: ffff88026ae89500 R11: 0000000044639632 R12: 0000000000000048
>>> [1838269.014417] R13: 0000000000000000 R14: 0000000044639632 R15: 0000000000000048
>>> [1838269.014519] FS:  0000000000000000(0000) GS:ffff880306640000(0000) knlGS:ffff880306640000
>>> [1838269.014629] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [1838269.014709] CR2: ffffffffff600400 CR3: 0000000051939000 CR4: 0000000000042660
>>> [1838269.014808] Stack:
>>> [1838269.014840]  ffffffff81744c17 ffff88026ae89500 0000000044639632 ffff88030008c4c0
>>> [1838269.014952]  ffffffff00000004 0000000000000004 ffff88032fcafe16 ffff88026ae89500
>>> [1838269.015064]  0000000000000004 0000000000000004 000000000000004c 0000000000000028
>>> [1838269.015176] Call Trace:
>>> [1838269.015217]   ? skb_copy_bits+0x137/0x2c0
>>> [1838269.015299]   __pskb_pull_tail+0x7f/0x3b0
>>> [1838269.015382]   tcp_gro_receive+0x2c5/0x300
>>> [1838269.015465]   tcp6_gro_receive+0x13a/0x1a0
>>> [1838269.015547]   ipv6_gro_receive+0x1c6/0x380
>>> [1838269.015630]   dev_gro_receive+0x269/0x3b0
>>> [1838269.015712]   napi_gro_receive+0x38/0xf0
>>> [1838269.015796]   igb_clean_rx_irq+0x38e/0x690 [igb]
>>> [1838269.015886]   igb_poll+0x362/0x720 [igb]
>>> [1838269.015968]   ? dequeue_entity+0x26e/0xa90
>>> [1838269.016051]   ? xen_mc_flush+0x17b/0x1b0
>>> [1838269.016131]   net_rx_action+0x158/0x360
>>> [1838269.016212]   __do_softirq+0xd1/0x283
>>> [1838269.016290]   ? sort_range+0x30/0x30
>>> [1838269.016366]   run_ksoftirqd+0x29/0x50
>>> [1838269.016443]   smpboot_thread_fn+0x110/0x160
>>> [1838269.016525]   kthread+0xd7/0xf0
>>> [1838269.016595]   ? kthread_park+0x60/0x60
>>> [1838269.016673]   ret_from_fork+0x25/0x30
>>> [1838269.016758] Code: ff 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89
>>> d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38
>>> [1838269.017183] RIP   memcpy_erms+0x6/0x10
>>> [1838269.017264]  RSP <ffffc9004197bac0>
>>> [1838269.020618] ---[ end trace 3506ce1d7200529a ]---
>>> [1838269.079891] Kernel panic - not syncing: Fatal exception in interrupt
>>> [1838269.080014] Kernel Offset: disabled
>>> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
>
> --Sarah

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ