[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <TY1PR0301MB10074DC6D1F5CE4F4B5AF7B5A0920@TY1PR0301MB1007.apcprd03.prod.outlook.com>
Date: Wed, 16 May 2018 01:51:36 +0000
From: Hirotaka Yamamoto <ymmt@...ozu.com>
To: "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: ECMP routing: problematic selection of outgoing interface
Hi,
Recently I have built a highly-available network using an ECMP
route connected to two isolated L2 switches as follows.
Router-- ToR switch 1 ---- Linux
| 192.168.11.1/24 | eth0: 192.168.11.2/24
| | eth1: 192.168.12.2/24
+-- ToR switch 2 ------+
192.168.12.1/24
The (default) route has been configured with:
$ sudo ip route add default \
nexthop via 192.168.11.1 \
nexthop via 192.168.12.1
Then I found that Linux chooses a wrong outgoing device for some
destination/source address pairs like this:
$ ip route get 12.34.56.78 from 192.168.12.2:
12.34.56.78 from 192.168.12.2 via 192.168.11.1 dev eth0 uid 0
# dev should be "eth1"
As a consequence, programs like SSH or curl do not work for such
destinations because routers drop packets having strange source
addresses.
Unbound sockets also suffer this problem. My guess for this is that
Linux chooses a source address first, then a wrong outgoing device.
Although I believe this is a bug in Linux, I found a possibly relevant
comment in function ip_route_output_key_hash_rcu at net/ipv4/route.c:
/* I removed check for oif == dev_out->oif here.
It was wrong for two reasons:
1. ip_dev_find(net, saddr) can return wrong iface, if saddr
is assigned to multiple interfaces.
2. Moreover, we are allowed to send packets with saddr
of another iface. --ANK
According to the comment 2, I wonder this behavior might be intended.
So, my question is:
1. Is this intended or not?
2. If this is intended, how can I make programs work in this ECMP network?
I have created a simple script to reproduce the problem (attached below).
The script creates a dedicated network namespace "testns" and configures
ECMP route to reproduce the problem.
So far, I can reproduce the problem with these Linux versions:
- 4.17-rc5 (Upstream)
- 4.15.0-20-generic (Ubuntu 18.04)
- 4.14.32-coreos (CoreOS)
- 4.13.0-37-generic (Ubuntu 16.04 HWE)
- 4.4.0-116-generic (Ubuntu 16.04)
Note that the problem is not limited to the default route.
Any route configured as ECMP can cause the problem.
- ymmt
#!/bin/sh -e
NS=testns
BR1=testbr1
VETH1=testveth1
BR2=testbr2
VETH2=testveth2
LINKS="$VETH1 $VETH2 $BR1 $BR2"
NET1=192.168.11.xx/24
NET2=192.168.12.xx/24
IPNS="ip netns exec $NS ip"
clean() {
for l in $LINKS; do
if ip -o link show $l >/dev/null 2>&1; then
ip link del $l
fi
done
if ip netns list | grep -q $NS; then
ip netns del $NS
fi
}
trap clean INT QUIT TERM HUP PIPE 0
make_address() {
local net addr
net=$1
addr=$2
echo $net | sed "s/xx/$addr/"
}
cidr2ip() {
echo $1 | cut -d / -f 1
}
GW1=$(make_address $NET1 1)
GW2=$(make_address $NET2 1)
ADDR1=$(make_address $NET1 2)
ADDR2=$(make_address $NET2 2)
setup_veth() {
local br veth dest
br=$1
veth=$2
dest=$3
ip link add $br type bridge
ip link add $veth type veth peer name ${veth}_
ip link set $br up
ip link set $veth master $br up
ip link set ${veth}_ netns $NS name $dest up
}
setup() {
ip netns add $NS
$IPNS link set lo up
setup_veth $BR1 $VETH1 eth0
setup_veth $BR2 $VETH2 eth1
local gw1 gw2
ip addr add $GW1 dev $BR1
ip addr add $GW2 dev $BR2
$IPNS addr add $ADDR1 dev eth0
$IPNS addr add $ADDR2 dev eth1
$IPNS route add 0.0.0.0/0 nexthop via $(cidr2ip $GW1) nexthop via $(cidr2ip $GW2)
}
test_route_from() {
local dest dev from r rdev
dest=$1
dev=$2
from=$3
r=$($IPNS -o route get $dest from $from)
rdev=$(echo $r | sed -nr 's/^.*dev (eth[[:digit:]]+).*/\1/p')
if [ "$dev" != "$rdev" ]; then
echo "WRONG dev/from pair: ip -o route get $dest from $from:"
printf "%s\n" "$r"
return
fi
}
test_route() {
test_route_from "$1" eth0 $(cidr2ip $ADDR1)
test_route_from "$1" eth1 $(cidr2ip $ADDR2)
}
run_tests() {
test_route 12.34.56.78
test_route 216.58.200.160
test_route 216.58.200.161
test_route 216.58.200.162
test_route 216.58.200.163
test_route 216.58.200.164
test_route 52.85.149.10
test_route 52.85.149.11
test_route 52.85.149.12
test_route 52.85.149.13
test_route 52.85.149.14
}
# main
setup
run_tests
read -p "Press enter to finish" ret
Powered by blists - more mailing lists