[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180108123826.pjqoj4u5cts4fgnm@gauss3.secunet.de>
Date: Mon, 8 Jan 2018 13:38:26 +0100
From: Steffen Klassert <steffen.klassert@...unet.com>
To: Ozgur <ozgur@...sey.org>
CC: Tobias Hommel <netdev-list@...oetigt.de>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in
xfrm_lookup
On Sat, Jan 06, 2018 at 12:27:11AM +0300, Ozgur wrote:
>
>
> 06.01.2018, 00:20, "Tobias Hommel" <netdev-list@...oetigt.de>:
> > Hi,
>
> Hi Tobias,
>
> > I'm running into a NULL pointer dereference after updating from Linux 4.1.6 to
> > 4.14.11 (see kernel log below). I tried 4.14.3 initially which did not work
> > either.
> > Anyone has an idea what is happening here?
> >
> > The affected machine has 2 active ethernet interfaces (igb driver) and acts as
> > a VPN gateway running strongswan. There are several hundreds of IPSec
> > roadwarriors connecting to eth1. eth0 connects to an infrastructure running an
> > HTTP server.
> > During my tests these roadwarriors connect to the gateway, sometimes download a
> > large file from the HTTP server, disconnect and after a random delay repeat
> > these steps.
> >
> > Some observations I made:
> > * SMP Affinity for IRQs of the NICs Rx/Tx queues (/proc/irq/$IRQ/smp_affinity)
> > * all affinities set to default ff is broken
> > * setting affinity for all queues of both interfaces to the same CPU seems to
> > work fine (running stable for more than 1 day now)
> > * setting affinity of eth0 queues to CPU 1 and affinity of eth1 queues to CPU
> > 2 is broken and seems to always trigger the bug on CPU 1
> > * the top 6 entries of the call trace are the same every time the system
> > crashes, the other entries differ sometimes
> >
> > The bug is 100% reproducible on the Intel Atom machine from the log below and
> > also on a HP ProLiant Gen6 (also igb driver).
> > I can, of course, provide further information (CPU, NIC, kernel config, more
> > traces, etc.) if required.
> > If helpful I could also run tests on HP ProLiant Gen9 which has different NICs
> > (tg3).
> >
> > [ 7998.489094] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> > [ 7998.496993] IP: xfrm_lookup+0x2a/0x7e0
> > [ 7998.500759] PGD 0 P4D 0
> > [ 7998.503316] Oops: 0000 [#1] SMP PTI
> > [ 7998.506835] Modules linked in:
> > [ 7998.509929] CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 4.14.11 #3
> > [ 7998.516244] Hardware name: To be filled by O.E.M. CAR-2051/CAR, BIOS 1.01 07/11/2016
> > [ 7998.524039] task: ffff8826bb118000 task.stack: ffff947ac00f0000
> > [ 7998.530004] RIP: 0010:xfrm_lookup+0x2a/0x7e0
> > [ 7998.534298] RSP: 0018:ffff947ac00f3b60 EFLAGS: 00010246
> > [ 7998.539550] RAX: 0000000000000000 RBX: ffffffff93074040 RCX: 0000000000000000
> > [ 7998.546709] RDX: ffff947ac00f3bd8 RSI: 0000000000000000 RDI: ffffffff93074040
> > [ 7998.553868] RBP: ffffffff93074040 R08: 0000000000000002 R09: 0000000000000001
> > [ 7998.561026] R10: 0000000000000032 R11: 0000000000000000 R12: ffff947ac00f3bd8
> > [ 7998.568212] R13: 0000000000000000 R14: 0000000000000002 R15: ffff8826b69a8078
> > [ 7998.575395] FS: 0000000000000000(0000) GS:ffff8826bfc80000(0000) knlGS:0000000000000000
> > [ 7998.583550] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 7998.589324] CR2: 0000000000000020 CR3: 00000001781da000 CR4: 00000000001006e0
> > [ 7998.596482] Call Trace:
> > [ 7998.598959] __xfrm_route_forward+0xa4/0x110
> > [ 7998.603263] ip_forward+0x3e0/0x450
> > [ 7998.606778] ? ip_rcv_finish+0x61/0x3a0
> > [ 7998.610645] ip_rcv+0x2c4/0x390
> > [ 7998.613818] ? inet_del_offload+0x30/0x30
> > [ 7998.617857] __netif_receive_skb_core+0x751/0xb00
> > [ 7998.622562] ? skb_send_sock+0x40/0x40
> > [ 7998.626356] ? netif_receive_skb_internal+0x47/0xf0
> > [ 7998.631252] netif_receive_skb_internal+0x47/0xf0
> > [ 7998.635987] napi_gro_receive+0x70/0x90
> > [ 7998.639835] gro_cell_poll+0x53/0x90
> > [ 7998.643439] net_rx_action+0x1fc/0x310
> > [ 7998.647210] ? rebalance_domains+0x101/0x2b0
> > [ 7998.651500] __do_softirq+0xd5/0x1cf
> > [ 7998.655105] run_ksoftirqd+0x14/0x30
> > [ 7998.658712] smpboot_thread_fn+0xf9/0x150
> > [ 7998.662723] kthread+0xef/0x130
> > [ 7998.665893] ? sort_range+0x20/0x20
> > [ 7998.669404] ? kthread_park+0x60/0x60
> > [ 7998.673098] ret_from_fork+0x1f/0x30
> > [ 7998.676674] Code: 00 41 57 41 56 45 89 c6 41 55 41 54 49 89 f5 55 53 49 89 d4 48 89 fb 48 83 ec 40 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 <48> 8b 46 20 48 85 c9 44 0f b7 38 c7 44 24 0c 00 00 00 00 0f 84
> > [ 7998.695681] RIP: xfrm_lookup+0x2a/0x7e0 RSP: ffff947ac00f3b60
> > [ 7998.701479] CR2: 0000000000000020
> > [ 7998.704799] ---[ end trace 0544b1946919baad ]---
> > [ 7998.709442] Kernel panic - not syncing: Fatal exception in interrupt
> > [ 7998.715918] Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>
>
> this error doesn't look like the last version kernel, I think this problem NIC driver.
Can you please explain why you think that this is a driver problem?
Powered by blists - more mailing lists