netdev - Re: Race condition in route lookup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Thu, 10 Oct 2019 11:46:08 +0300
From:   Ido Schimmel <idosch@...sch.org>
To:     Jesse Hathaway <jesse@...ki-mvuki.org>
Cc:     netdev@...r.kernel.org
Subject: Re: Race condition in route lookup

On Thu, Oct 10, 2019 at 11:31:04AM +0300, Ido Schimmel wrote:
> On Wed, Oct 09, 2019 at 11:00:07AM -0500, Jesse Hathaway wrote:
> > We have been experiencing a route lookup race condition on our internet facing
> > Linux routers. I have been able to reproduce the issue, but would love more
> > help in isolating the cause.
> > 
> > Looking up a route found in the main table returns `*` rather than the directly
> > connected interface about once for every 10-20 million requests. From my
> > reading of the iproute2 source code an asterisk is indicative of the kernel
> > returning and interface index of 0 rather than the correct directly connected
> > interface.
> > 
> > This is reproducible with the following bash snippet on 5.4-rc2:
> > 
> >   $ cat route-race
> >   #!/bin/bash
> > 
> >   # Generate 50 million individual route gets to feed as batch input to `ip`
> >   function ip-cmds() {
> >           route_get='route get 192.168.11.142 from 192.168.180.10 iif vlan180'
> >           for ((i = 0; i < 50000000; i++)); do
> >                   printf '%s\n' "${route_get}"
> >           done
> > 
> >   }
> > 
> >   ip-cmds | ip -d -o -batch - | grep -E 'dev \*' | uniq -c
> > 
> > Example output:
> > 
> >   $ ./route-race
> >         6 unicast 192.168.11.142 from 192.168.180.10 dev * table main
> > \    cache iif vlan180
> > 
> > These routers have multiple routing tables and are ingesting full BGP routing
> > tables from multiple ISPs:
> > 
> >   $ ip route show table all | wc -l
> >   3105543
> > 
> >   $ ip route show table main | wc -l
> >   54
> > 
> > Please let me know what other information I can provide, thanks in advance,
> 
> I think it's working as expected. Here is my theory:
> 
> If CPU0 is executing both the route get request and forwarding packets
> through the directly connected interface, then the following can happen:
> 
> <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop

Sorry, only output path is per-CPU. See commit d26b3a7c4b3b ("ipv4:
percpu nh_rth_output cache"). I indeed see the issue regardless of the
CPU on which I run the route get request.

> is found. Not yet dumped to user space
> 
> <Any CPU, t1> - Routes are added / removed, therefore invalidating the
> cache by bumping 'net->ipv4.rt_genid'
> 
> <CPU0, t2> - In softirq, packet is forwarded through the nexthop. The
> cached dst entry is found to be invalid. Therefore, it is replaced by a
> newer dst entry. dst_dev_put() is called on old entry which assigns the
> blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because
> it is not registered.
> 
> <CPU0, t3> - After softirq finished executing, your route get request
> from t0 is resumed and the old dst entry is dumped to user space with
> ifindex of 0.
> 
> I tested this on my system using your script to generate the route get
> requests. I pinned it to the same CPU forwarding packets through the
> nexthop. To constantly invalidate the cache I created another script
> that simply adds and removes IP addresses from an interface.
> 
> If I stop the packet forwarding or the script that invalidates the
> cache, then I don't see any '*' answers to my route get requests.
> 
> BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that
> with older kernel versions you'll see 'lo' instead of '*'.