netdev - Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in xfrm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180109090651.wuxcdju5ynlkq25l@arbeitstier>
Date:   Tue, 9 Jan 2018 10:06:51 +0100
From:   Tobias Hommel <netdev-list@...oetigt.de>
To:     Steffen Klassert <steffen.klassert@...unet.com>
Cc:     netdev@...r.kernel.org
Subject: Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in
 xfrm_lookup

On Tue, Jan 09, 2018 at 09:19:39AM +0100, Steffen Klassert wrote:
> On Mon, Jan 08, 2018 at 02:53:48PM +0100, Tobias Hommel wrote:
> 
> ...
> 
> > [  439.095554] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> > [  439.103664] IP: xfrm_lookup+0x2a/0x7d0
> > [  439.107551] PGD 0 P4D 0 
> > [  439.110144] Oops: 0000 [#1] SMP PTI
> > [  439.113653] Modules linked in:
> > [  439.116774] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.14.12 #1
> > [  439.122900] Hardware name: To be filled by O.E.M. CAR-2051/CAR, BIOS 1.01 07/11/2016
> > [  439.130769] task: ffff8cf33b0ea280 task.stack: ffff9492c0090000
> > [  439.136726] RIP: 0010:xfrm_lookup+0x2a/0x7d0
> > [  439.141005] RSP: 0018:ffff8cf33fd83bd0 EFLAGS: 00010246
> > [  439.146315] RAX: 0000000000000000 RBX: ffffffff87074080 RCX: 0000000000000000
> > [  439.153537] RDX: ffff8cf33fd83c48 RSI: 0000000000000000 RDI: ffffffff87074080
> > [  439.160780] RBP: ffffffff87074080 R08: 0000000000000002 R09: 0000000000000000
> > [  439.167958] R10: 0000000000000020 R11: 0000000000000020 R12: ffff8cf33fd83c48
> > [  439.175115] R13: 0000000000000000 R14: 0000000000000002 R15: ffff8cf33b240078
> > [  439.182337] FS:  0000000000000000(0000) GS:ffff8cf33fd80000(0000) knlGS:0000000000000000
> > [  439.190456] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  439.196227] CR2: 0000000000000020 CR3: 000000013200a000 CR4: 00000000001006e0
> > [  439.203386] Call Trace:
> > [  439.205869]  <IRQ>
> > [  439.207886]  __xfrm_route_forward+0xa4/0x110
> > [  439.212195]  ip_forward+0x3da/0x450
> > [  439.215696]  ? ip_rcv_finish+0x61/0x390
> > [  439.219542]  ip_rcv+0x2b5/0x380
> > [  439.222716]  ? inet_del_offload+0x30/0x30
> > [  439.226736]  __netif_receive_skb_core+0x751/0xb00
> > [  439.231469]  ? netif_receive_skb_internal+0x47/0xf0
> > [  439.236391]  netif_receive_skb_internal+0x47/0xf0
> > [  439.241150]  napi_gro_flush+0x50/0x70
> > [  439.244831]  napi_complete_done+0x90/0xd0
> > [  439.248872]  igb_poll+0x8fd/0xe80
> > [  439.252190]  net_rx_action+0x1fc/0x310
> > [  439.255978]  __do_softirq+0xd5/0x1cf
> > [  439.259584]  irq_exit+0xa3/0xb0
> > [  439.262763]  do_IRQ+0x45/0xc0
> > [  439.265772]  common_interrupt+0x95/0x95
> > [  439.269609]  </IRQ>
> > [  439.271733] RIP: 0010:cpuidle_enter_state+0x120/0x200
> > [  439.276810] RSP: 0018:ffff9492c0093eb8 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff5d
> > [  439.284436] RAX: ffff8cf33fd9ea80 RBX: 0000000000000002 RCX: 000000663c21ea0f
> > [  439.291604] RDX: 0000000000000000 RSI: 00000000355556ca RDI: 0000000000000000
> > [  439.298772] RBP: ffff8cf33fda71e8 R08: 0000000000000003 R09: 0000000000000018
> > [  439.305930] R10: 00000000ffffffff R11: 000000000000057c R12: 000000663c21ea0f
> > [  439.313089] R13: 000000663c1c6c33 R14: 0000000000000002 R15: 0000000000000000
> > [  439.320259]  ? cpuidle_enter_state+0x11c/0x200
> > [  439.324740]  do_idle+0xd6/0x170
> > [  439.327885]  cpu_startup_entry+0x67/0x70
> > [  439.331837]  start_secondary+0x167/0x190
> > [  439.335788]  secondary_startup_64+0xa5/0xb0
> > [  439.340001] Code: 00 41 57 41 56 45 89 c6 41 55 41 54 49 89 f5 55 53 49 89 d4 48 89 fb 48 83 ec 40 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 <48> 8b 46 20 48 85 c9 44 0f b7 38 c7 44 24 0c 00 00 00 00 0f 84 
> > [  439.358988] RIP: xfrm_lookup+0x2a/0x7d0 RSP: ffff8cf33fd83bd0
> > [  439.364759] CR2: 0000000000000020
> > [  439.368105] ---[ end trace c6b298b556ea7769 ]---
> > [  439.372752] Kernel panic - not syncing: Fatal exception in interrupt
> > [  439.379255] Kernel Offset: 0x5000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > [  439.390029] Rebooting in 10 seconds..
> 
> ...
> 
> > 0000000000004230 <xfrm_lookup>:
> >     4230:	41 57                	push   %r15
> >     4232:	41 56                	push   %r14
> >     4234:	45 89 c6             	mov    %r8d,%r14d
> >     4237:	41 55                	push   %r13
> >     4239:	41 54                	push   %r12
> >     423b:	49 89 f5             	mov    %rsi,%r13
> >     423e:	55                   	push   %rbp
> >     423f:	53                   	push   %rbx
> >     4240:	49 89 d4             	mov    %rdx,%r12
> >     4243:	48 89 fb             	mov    %rdi,%rbx
> >     4246:	48 83 ec 40          	sub    $0x40,%rsp
> >     424a:	65 48 8b 04 25 28 00 	mov    %gs:0x28,%rax
> >     4251:	00 00 
> >     4253:	48 89 44 24 38       	mov    %rax,0x38(%rsp)
> >     4258:	31 c0                	xor    %eax,%eax
> >     425a:	48 8b 46 20          	mov    0x20(%rsi),%rax
> 
> 
> The above is the failing instruction, RSI holds the second argument
> of the called function which is a NULL pointer. The second argument
> of xfrm_lookup() is dst_orig, so it is as I thought. Now let's find
> out why. I don't see anything obvious, so we need to narrow it down.
> 
> > CONFIG_INET_ESP=y
> > CONFIG_INET_ESP_OFFLOAD=y
> 
> You have CONFIG_INET_ESP_OFFLOAD enabled, this is new maybe it
> still has some problems. You should not hit an offload codepath
> because all your SAs are configured with UDP encapsulation which
> is still not supported with offload.
> 
> Please try to disable GRO on both interfaces and see what happens:
> 
> ethtool -K eth0 gro off
> ethtool -K eth1 gro off
I actually already tried that with only eth1 off, to verify I turned offloading
off for both interfaces. The same problem: see attached panic.gro_off.log

> 
> Then disable CONFIG_INET_ESP_OFFLOAD and try again.
Rebuild with CONFIG_INET_ESP_OFFLOAD disabled, same problem: see attached
panic.esp_offload_disabled.log

> 
> This should show us if this feature is responsible for the bug.
> 

I will try narrowing down the problem by trying out some older kernels for now.

View attachment "panic.esp_offload_disabled.log" of type "text/plain" (3344 bytes)

View attachment "panic.gro_off.log" of type "text/plain" (2712 bytes)