netdev - Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in xfrm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180119144546.2yfcil4c6sga5l7h@arbeitstier>
Date:   Fri, 19 Jan 2018 15:45:46 +0100
From:   Tobias Hommel <netdev-list@...oetigt.de>
To:     Steffen Klassert <steffen.klassert@...unet.com>
Cc:     netdev@...r.kernel.org
Subject: Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in
 xfrm_lookup

On Wed, Jan 10, 2018 at 10:03:05AM +0100, Tobias Hommel wrote:
> On Wed, Jan 10, 2018 at 08:30:38AM +0100, Steffen Klassert wrote:
> > On Tue, Jan 09, 2018 at 03:49:21PM +0100, Tobias Hommel wrote:
> > > 
> > > I copied the config from my 4.14.12 sources to a fresh 4.13.16 source tree, ran
> > > `make olddefconfig` and built a new kernel.
> > > The kernel config is attached as kernel-4.13.16.config.
> > > The panic*.log files are kernel logs from different crashes of this 4.13.16
> > > kernel, but all from the same scenario as before.
> > > I also enabled CONFIG_DEBUG_INFO, so if any disassemblies are required, I'd be
> > > happy to provide them.
> > > 
> > > So, the system still crashes, but the traces are completely different from
> > > those with 4.14.12. This time there are also WARNINGs and "refcnt: -1" messages
> > > sometimes before the actual panic, so not sure if there is maybe some other
> > > problem. Still, the crashes all seem to be related to ip routing somehow.
> > 
> > Strange, you must do something that other people don't do.
> > Do you have some uncommon netfiler rules, namespaces, etc?
> No, no namespaces yet.
> However, the box uses marks and routing based on marks. Firewall marks are a
> bit strange sometimes, so I'll try to clean up everything and see if it is
> possible to reproduce the bug without marks.

I tried to strip down the system configuration and was able to reproduce the
problem with a minimal configuration:
* ipsets are not used anymore
* no firewall markings are used any longer
* iptables are "completely empty", i.e. all policies set to ACCEPT and there is
  no rule in any table
* no additional routing policies (ip rule) except the default ones
* only main routing table is used
* using a "minimal" kernel config:
 * run `make defconfig`
 * add basic things (ESP, IGB driver, some crypto algorithms)
 * add options required to boot up the system (TPM crypt, some device mapper
   options, overlayfs)

I attached the minimal config (minimal.config) and the defconfig for reference
(minimal.defconfig).

The setup is really simple now, the gateway is forwarding HTTP connections
between eth1(IPSec tunnels) and eth0 without any firewall, NAT, whatsoever.

The only thing I can think of are the rather aggressive roadwarrior clients.
There are 750 roadwarriors that are controlled by a script which starts and
stops the IPSec connection. Sometimes the clients are also instructed to start
an HTTP download. Sometimes the clients are also stopped the hard way (kill -9)
so SAs are not removed on the gateway. The clients reconnect after a random
interval (sometimes immediately) and sometimes also immediately start a new
HTTP download.
Maybe something is wrong with strongswan removing old SAs and creating new ones
for "the same client"? Maybe while the kernel is processing an HTTP packet from
an old client connection, a new SA for the same client is set up and then the
routing lookup fails (I only know that xfrm is involved in routing lookups, but
I'm no expert here)?

> 
> > 
> > Please try to build your kernels with
> > 
> > CONFIG_ORC_UNWINDER (v4.14 and above)
> > 
> > and
> > 
> > CONFIG_KASAN
> > 
> > This can give some better debug informations (depends on the compiler
> > version).
> I'll also try that. I'm currently using GCC 5.4.0.
> 
> > 
> > There are some things we can do now:
> > 
> > - Try v4.15-rc7, just to be sure that we don't search for
> >   something that is already fixed.
> And that one, too. All this will probably take some time though. ;-)
> I'll keep you informed.
I tried 4.15-rc8 and have the same problem here (see attached
kernel-4.15-rc8.log). SMP affinity for IRQs has changed in 4.15 and something's
broken there ("do_IRQ: 0.41 No irq handler for vector") and although I could
not spread IRQs over all cores I was able to "pin" different IRQs to different
cores and reproduce the problem.
Also kasan is reporting some "use-after-free" during startup in the page
poisoning code. So I disabled page poisoning "to get rid of this bug", but the
problem persists.

> 
> > 
> > - Find a working kernel version and try to bisect.
> > 
> > - Minimalize the configuration with that the bug happens,
> >   so that I can try to reproduce it here.
> > 

View attachment "minimal.config" of type "text/plain" (115780 bytes)

View attachment "minimal.defconfig" of type "text/plain" (115182 bytes)

View attachment "kernel-4.15-rc8.log" of type "text/plain" (52642 bytes)