[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180108135347.3gvr5dhkzqb7pzhm@arbeitstier>
Date: Mon, 8 Jan 2018 14:53:48 +0100
From: Tobias Hommel <netdev-list@...oetigt.de>
To: Steffen Klassert <steffen.klassert@...unet.com>
Cc: netdev@...r.kernel.org
Subject: Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in
xfrm_lookup
On Mon, Jan 08, 2018 at 12:36:08PM +0000, Steffen Klassert wrote:
> On Fri, Jan 05, 2018 at 10:13:23PM +0100, Tobias Hommel wrote:
> > Hi,
> >
> > I'm running into a NULL pointer dereference after updating from Linux 4.1.6 to
> > 4.14.11 (see kernel log below). I tried 4.14.3 initially which did not work
> > either.
> > Anyone has an idea what is happening here?
> >
> > The affected machine has 2 active ethernet interfaces (igb driver) and acts as
> > a VPN gateway running strongswan. There are several hundreds of IPSec
> > roadwarriors connecting to eth1. eth0 connects to an infrastructure running an
> > HTTP server.
> > During my tests these roadwarriors connect to the gateway, sometimes download a
> > large file from the HTTP server, disconnect and after a random delay repeat
> > these steps.
> >
> > Some observations I made:
> > * SMP Affinity for IRQs of the NICs Rx/Tx queues (/proc/irq/$IRQ/smp_affinity)
> > * all affinities set to default ff is broken
> > * setting affinity for all queues of both interfaces to the same CPU seems to
> > work fine (running stable for more than 1 day now)
> > * setting affinity of eth0 queues to CPU 1 and affinity of eth1 queues to CPU
> > 2 is broken and seems to always trigger the bug on CPU 1
> > * the top 6 entries of the call trace are the same every time the system
> > crashes, the other entries differ sometimes
> >
> > The bug is 100% reproducible on the Intel Atom machine from the log below and
> > also on a HP ProLiant Gen6 (also igb driver).
> > I can, of course, provide further information (CPU, NIC, kernel config, more
> > traces, etc.) if required.
> > If helpful I could also run tests on HP ProLiant Gen9 which has different NICs
> > (tg3).
> >
> > [ 7998.489094] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> > [ 7998.496993] IP: xfrm_lookup+0x2a/0x7e0
>
> xfrm_lookup+0x2a is at the very beginning of xfrm_lookup(), here we
> find:
>
> u16 family = dst_orig->ops->family;
>
> ops has an offset of 32 bytes (20 hex) in dst_orig, so looks like
> dst_orig is NULL.
>
> In the forwarding case, we get dst_orig from the skb and dst_orig
> can't be NULL here unless the skb itself is already fishy.
>
> Can you provide the following informations:
>
> - Your kernel config
>
Attached as kernel-4.14.12.config
> - The output of 'ip x p' and 'ip x s'
>
Attached as ipxs.output and ipxp.output.
NOTE: These command outputs are from "some seconds" before the crash. As the
roadwarriors are scripted it was not possible to get the state from the
time of the crash.
If this is a problem I could try to reproduce the problem with fewer
roadwarriors.
> - An object dump of xfrm_policy.o if possible 'objdump -d -S net/xfrm/xfrm_policy.o'
> (The path to xfrm_policy.o depends on how you build your kernels)
>
Attached as xfrm_policy.objdump
I also attached a panic-4.14.12.log which was created using the same kernel
to which the objdump belongs.
View attachment "ipxp.output" of type "text/plain" (68526 bytes)
View attachment "ipxs.output" of type "text/plain" (110230 bytes)
View attachment "kernel-4.14.12.config" of type "text/plain" (102638 bytes)
View attachment "panic-4.14.12.log" of type "text/plain" (3337 bytes)
View attachment "xfrm_policy.objdump" of type "text/plain" (363567 bytes)
Powered by blists - more mailing lists