netdev - BUG: 4.14.11 unable to handle kernel NULL pointer dereference in xfrm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Date:   Fri, 5 Jan 2018 22:13:23 +0100
From:   Tobias Hommel <netdev-list@...oetigt.de>
To:     netdev@...r.kernel.org
Subject: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in
 xfrm_lookup

Hi,

I'm running into a NULL pointer dereference after updating from Linux 4.1.6 to
4.14.11 (see kernel log below). I tried 4.14.3 initially which did not work
either.
Anyone has an idea what is happening here?

The affected machine has 2 active ethernet interfaces (igb driver) and acts as
a VPN gateway running strongswan. There are several hundreds of IPSec
roadwarriors connecting to eth1. eth0 connects to an infrastructure running an
HTTP server.
During my tests these roadwarriors connect to the gateway, sometimes download a
large file from the HTTP server, disconnect and after a random delay repeat
these steps.

Some observations I made:
* SMP Affinity for IRQs of the NICs Rx/Tx queues (/proc/irq/$IRQ/smp_affinity)
  * all affinities set to default ff is broken
  * setting affinity for all queues of both interfaces to the same CPU seems to
    work fine (running stable for more than 1 day now)
  * setting affinity of eth0 queues to CPU 1 and affinity of eth1 queues to CPU
    2 is broken and seems to always trigger the bug on CPU 1
* the top 6 entries of the call trace are the same every time the system
  crashes, the other entries differ sometimes

The bug is 100% reproducible on the Intel Atom machine from the log below and
also on a HP ProLiant Gen6 (also igb driver).
I can, of course, provide further information (CPU, NIC, kernel config, more
traces, etc.) if required.
If helpful I could also run tests on HP ProLiant Gen9 which has different NICs
(tg3).

[ 7998.489094] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[ 7998.496993] IP: xfrm_lookup+0x2a/0x7e0
[ 7998.500759] PGD 0 P4D 0 
[ 7998.503316] Oops: 0000 [#1] SMP PTI
[ 7998.506835] Modules linked in:
[ 7998.509929] CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 4.14.11 #3
[ 7998.516244] Hardware name: To be filled by O.E.M. CAR-2051/CAR, BIOS 1.01 07/11/2016
[ 7998.524039] task: ffff8826bb118000 task.stack: ffff947ac00f0000
[ 7998.530004] RIP: 0010:xfrm_lookup+0x2a/0x7e0
[ 7998.534298] RSP: 0018:ffff947ac00f3b60 EFLAGS: 00010246
[ 7998.539550] RAX: 0000000000000000 RBX: ffffffff93074040 RCX: 0000000000000000
[ 7998.546709] RDX: ffff947ac00f3bd8 RSI: 0000000000000000 RDI: ffffffff93074040
[ 7998.553868] RBP: ffffffff93074040 R08: 0000000000000002 R09: 0000000000000001
[ 7998.561026] R10: 0000000000000032 R11: 0000000000000000 R12: ffff947ac00f3bd8
[ 7998.568212] R13: 0000000000000000 R14: 0000000000000002 R15: ffff8826b69a8078
[ 7998.575395] FS:  0000000000000000(0000) GS:ffff8826bfc80000(0000) knlGS:0000000000000000
[ 7998.583550] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7998.589324] CR2: 0000000000000020 CR3: 00000001781da000 CR4: 00000000001006e0
[ 7998.596482] Call Trace:
[ 7998.598959]  __xfrm_route_forward+0xa4/0x110
[ 7998.603263]  ip_forward+0x3e0/0x450
[ 7998.606778]  ? ip_rcv_finish+0x61/0x3a0
[ 7998.610645]  ip_rcv+0x2c4/0x390
[ 7998.613818]  ? inet_del_offload+0x30/0x30
[ 7998.617857]  __netif_receive_skb_core+0x751/0xb00
[ 7998.622562]  ? skb_send_sock+0x40/0x40
[ 7998.626356]  ? netif_receive_skb_internal+0x47/0xf0
[ 7998.631252]  netif_receive_skb_internal+0x47/0xf0
[ 7998.635987]  napi_gro_receive+0x70/0x90
[ 7998.639835]  gro_cell_poll+0x53/0x90
[ 7998.643439]  net_rx_action+0x1fc/0x310
[ 7998.647210]  ? rebalance_domains+0x101/0x2b0
[ 7998.651500]  __do_softirq+0xd5/0x1cf
[ 7998.655105]  run_ksoftirqd+0x14/0x30
[ 7998.658712]  smpboot_thread_fn+0xf9/0x150
[ 7998.662723]  kthread+0xef/0x130
[ 7998.665893]  ? sort_range+0x20/0x20
[ 7998.669404]  ? kthread_park+0x60/0x60
[ 7998.673098]  ret_from_fork+0x1f/0x30
[ 7998.676674] Code: 00 41 57 41 56 45 89 c6 41 55 41 54 49 89 f5 55 53 49 89 d4 48 89 fb 48 83 ec 40 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 <48> 8b 46 20 48 85 c9 44 0f b7 38 c7 44 24 0c 00 00 00 00 0f 84 
[ 7998.695681] RIP: xfrm_lookup+0x2a/0x7e0 RSP: ffff947ac00f3b60
[ 7998.701479] CR2: 0000000000000020
[ 7998.704799] ---[ end trace 0544b1946919baad ]---
[ 7998.709442] Kernel panic - not syncing: Fatal exception in interrupt
[ 7998.715918] Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Best regards,

Tobias Hommel