netdev - Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in xfrm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180108123608.tkqe2bhjzcw3ya2y@gauss3.secunet.de>
Date:   Mon, 8 Jan 2018 13:36:08 +0100
From:   Steffen Klassert <steffen.klassert@...unet.com>
To:     Tobias Hommel <netdev-list@...oetigt.de>
CC:     <netdev@...r.kernel.org>
Subject: Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in
 xfrm_lookup

On Fri, Jan 05, 2018 at 10:13:23PM +0100, Tobias Hommel wrote:
> Hi,
> 
> I'm running into a NULL pointer dereference after updating from Linux 4.1.6 to
> 4.14.11 (see kernel log below). I tried 4.14.3 initially which did not work
> either.
> Anyone has an idea what is happening here?
> 
> The affected machine has 2 active ethernet interfaces (igb driver) and acts as
> a VPN gateway running strongswan. There are several hundreds of IPSec
> roadwarriors connecting to eth1. eth0 connects to an infrastructure running an
> HTTP server.
> During my tests these roadwarriors connect to the gateway, sometimes download a
> large file from the HTTP server, disconnect and after a random delay repeat
> these steps.
> 
> Some observations I made:
> * SMP Affinity for IRQs of the NICs Rx/Tx queues (/proc/irq/$IRQ/smp_affinity)
>   * all affinities set to default ff is broken
>   * setting affinity for all queues of both interfaces to the same CPU seems to
>     work fine (running stable for more than 1 day now)
>   * setting affinity of eth0 queues to CPU 1 and affinity of eth1 queues to CPU
>     2 is broken and seems to always trigger the bug on CPU 1
> * the top 6 entries of the call trace are the same every time the system
>   crashes, the other entries differ sometimes
> 
> The bug is 100% reproducible on the Intel Atom machine from the log below and
> also on a HP ProLiant Gen6 (also igb driver).
> I can, of course, provide further information (CPU, NIC, kernel config, more
> traces, etc.) if required.
> If helpful I could also run tests on HP ProLiant Gen9 which has different NICs
> (tg3).
> 
> [ 7998.489094] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> [ 7998.496993] IP: xfrm_lookup+0x2a/0x7e0

xfrm_lookup+0x2a is at the very beginning of xfrm_lookup(), here we
find:

u16 family = dst_orig->ops->family;

ops has an offset of 32 bytes (20 hex) in dst_orig, so looks like
dst_orig is NULL.

In the forwarding case, we get dst_orig from the skb and dst_orig
can't be NULL here unless the skb itself is already fishy.

Can you provide the following informations:

- Your kernel config

- The output of 'ip x p' and 'ip x s'

- An object dump of xfrm_policy.o if possible 'objdump -d -S net/xfrm/xfrm_policy.o'
  (The path to xfrm_policy.o depends on how you build your kernels)