netdev - Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in xfrm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180108085718.dhn45syjw6k2u5x4@arbeitstier>
Date:   Mon, 8 Jan 2018 09:57:18 +0100
From:   Tobias Hommel <netdev-list@...oetigt.de>
To:     Ozgur <ozgur@...sey.org>
Cc:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in
 xfrm_lookup

On Fri, Jan 05, 2018 at 09:55:23PM +0000, Tobias Hommel wrote:
> On Sat, Jan 06, 2018 at 12:27:11AM +0300, Ozgur wrote:
> > 
> > 
> > 06.01.2018, 00:20, "Tobias Hommel" <netdev-list@...oetigt.de>:
> > > Hi,
> > 
> > Hi Tobias,
> > 
> > > I'm running into a NULL pointer dereference after updating from Linux 4.1.6 to
> > > 4.14.11 (see kernel log below). I tried 4.14.3 initially which did not work
> > > either.
> > > Anyone has an idea what is happening here?
> > >
> > > The affected machine has 2 active ethernet interfaces (igb driver) and acts as
> > > a VPN gateway running strongswan. There are several hundreds of IPSec
> > > roadwarriors connecting to eth1. eth0 connects to an infrastructure running an
> > > HTTP server.
> > > During my tests these roadwarriors connect to the gateway, sometimes download a
> > > large file from the HTTP server, disconnect and after a random delay repeat
> > > these steps.
> > >
> > > Some observations I made:
> > > * SMP Affinity for IRQs of the NICs Rx/Tx queues (/proc/irq/$IRQ/smp_affinity)
> > >   * all affinities set to default ff is broken
> > >   * setting affinity for all queues of both interfaces to the same CPU seems to
> > >     work fine (running stable for more than 1 day now)
> > >   * setting affinity of eth0 queues to CPU 1 and affinity of eth1 queues to CPU
> > >     2 is broken and seems to always trigger the bug on CPU 1
> > > * the top 6 entries of the call trace are the same every time the system
> > >   crashes, the other entries differ sometimes
> > >
> > > The bug is 100% reproducible on the Intel Atom machine from the log below and
> > > also on a HP ProLiant Gen6 (also igb driver).
> > > I can, of course, provide further information (CPU, NIC, kernel config, more
> > > traces, etc.) if required.
> > > If helpful I could also run tests on HP ProLiant Gen9 which has different NICs
> > > (tg3).
> > >
> > > [ 7998.489094] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> > > [ 7998.496993] IP: xfrm_lookup+0x2a/0x7e0
> > > [ 7998.500759] PGD 0 P4D 0
> > > [ 7998.503316] Oops: 0000 [#1] SMP PTI
> > > [ 7998.506835] Modules linked in:
> > > [ 7998.509929] CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 4.14.11 #3
> > > [ 7998.516244] Hardware name: To be filled by O.E.M. CAR-2051/CAR, BIOS 1.01 07/11/2016
> > > [ 7998.524039] task: ffff8826bb118000 task.stack: ffff947ac00f0000
> > > [ 7998.530004] RIP: 0010:xfrm_lookup+0x2a/0x7e0
> > > [ 7998.534298] RSP: 0018:ffff947ac00f3b60 EFLAGS: 00010246
> > > [ 7998.539550] RAX: 0000000000000000 RBX: ffffffff93074040 RCX: 0000000000000000
> > > [ 7998.546709] RDX: ffff947ac00f3bd8 RSI: 0000000000000000 RDI: ffffffff93074040
> > > [ 7998.553868] RBP: ffffffff93074040 R08: 0000000000000002 R09: 0000000000000001
> > > [ 7998.561026] R10: 0000000000000032 R11: 0000000000000000 R12: ffff947ac00f3bd8
> > > [ 7998.568212] R13: 0000000000000000 R14: 0000000000000002 R15: ffff8826b69a8078
> > > [ 7998.575395] FS: 0000000000000000(0000) GS:ffff8826bfc80000(0000) knlGS:0000000000000000
> > > [ 7998.583550] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 7998.589324] CR2: 0000000000000020 CR3: 00000001781da000 CR4: 00000000001006e0
> > > [ 7998.596482] Call Trace:
> > > [ 7998.598959] __xfrm_route_forward+0xa4/0x110
> > > [ 7998.603263] ip_forward+0x3e0/0x450
> > > [ 7998.606778] ? ip_rcv_finish+0x61/0x3a0
> > > [ 7998.610645] ip_rcv+0x2c4/0x390
> > > [ 7998.613818] ? inet_del_offload+0x30/0x30
> > > [ 7998.617857] __netif_receive_skb_core+0x751/0xb00
> > > [ 7998.622562] ? skb_send_sock+0x40/0x40
> > > [ 7998.626356] ? netif_receive_skb_internal+0x47/0xf0
> > > [ 7998.631252] netif_receive_skb_internal+0x47/0xf0
> > > [ 7998.635987] napi_gro_receive+0x70/0x90
> > > [ 7998.639835] gro_cell_poll+0x53/0x90
> > > [ 7998.643439] net_rx_action+0x1fc/0x310
> > > [ 7998.647210] ? rebalance_domains+0x101/0x2b0
> > > [ 7998.651500] __do_softirq+0xd5/0x1cf
> > > [ 7998.655105] run_ksoftirqd+0x14/0x30
> > > [ 7998.658712] smpboot_thread_fn+0xf9/0x150
> > > [ 7998.662723] kthread+0xef/0x130
> > > [ 7998.665893] ? sort_range+0x20/0x20
> > > [ 7998.669404] ? kthread_park+0x60/0x60
> > > [ 7998.673098] ret_from_fork+0x1f/0x30
> > > [ 7998.676674] Code: 00 41 57 41 56 45 89 c6 41 55 41 54 49 89 f5 55 53 49 89 d4 48 89 fb 48 83 ec 40 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 <48> 8b 46 20 48 85 c9 44 0f b7 38 c7 44 24 0c 00 00 00 00 0f 84
> > > [ 7998.695681] RIP: xfrm_lookup+0x2a/0x7e0 RSP: ffff947ac00f3b60
> > > [ 7998.701479] CR2: 0000000000000020
> > > [ 7998.704799] ---[ end trace 0544b1946919baad ]---
> > > [ 7998.709442] Kernel panic - not syncing: Fatal exception in interrupt
> > > [ 7998.715918] Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > 
> > 
> > this error doesn't look like the last version kernel, I think this problem NIC driver.
> > What is the use network ethernet card model?
> This is what lspci shows for both NICs:
> # lspci -nns 00:14.0
> 00:14.0 Ethernet controller [0200]: Intel Corporation Ethernet Connection I354 [8086:1f41] (rev 03)
> 
> I have currently no access to the other hardware where this is happening but I
> could get further information after the weekend.

This is the NIC model on the other machine:
0a:00.0 Ethernet controller [0200]: Intel Corporation 82576 Gigabit Network Connection [8086:10c9] (rev 01)

> 
> > And which driver version you use?
> # ethtool -i eth0  # same for eth1
> driver: igb
> version: 5.4.0-k
> firmware-version: 0.0.0
> expansion-rom-version: 
> bus-info: 0000:00:14.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
> 
Btw, this is the driver shipping with Linux 4.14.11.

> > 
> > > Best regards,
> > >
> > > Tobias Hommel
> > 
> > Ozgur