[<prev] [next>] [day] [month] [year] [list]
Message-ID: <SJ0PR84MB20881BEC4AC4703A84425045D8D72@SJ0PR84MB2088.NAMPRD84.PROD.OUTLOOK.COM>
Date: Thu, 27 Jun 2024 23:53:21 +0000
From: "Muggeridge, Matt" <matt.muggeridge2@....com>
To: "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Wrong nexthop selection with two default routers where only one is
REACHABLE
Hi,
This appears to be a bug in Linux kernel networking. This was observed on a fresh install of Ubuntu 24.04, with Linux 6.8.0-36-generic.
* PROBLEM
In the network diagram below, I have two default routers (TR1 and TR2). The HUT has two neighbor cache entries: TR1=REACHABLE and TR2=INCOMPLETE.? When I ping the host (HUT) from a remote test node (TN2) via TR1, the HUT sends a NS for TR2 when it should have replied directly via TR1.? This breaks communication and violates IPv6 Logo compliance.
??????????? TN2
???????????? |
??? +--------+--------+
??? |???????????????? |
?? TR1?????????????? TR2
(REACHABLE)????? (INCOMPLETE)
??? |???????????????? |
??? +--------+--------+
???????????? |
??????????? HUT
The RFC for Neighbor Discovery describes the policy for selecting routes from the Default Router List. The relevant bullet is extracted below.
https://datatracker.ietf.org/doc/html/rfc4861#section-6.3.6
| The policy for selecting routers from the Default Router List is as
| follows:
|
| 1) Routers that are reachable or probably reachable (i.e., in any
?|?? state other than INCOMPLETE) SHOULD be preferred over routers
?|?? whose reachability is unknown or suspect (i.e., in the
?|?? INCOMPLETE state, or for which no Neighbor Cache entry exists).
?|?? Further implementation hints on default router selection when
?|?? multiple equivalent routers are available are discussed in
?|?? [[LD-SHRE](https://datatracker.ietf.org/doc/html/rfc4861#ref-LD-SHRE)].
* REPRODUCER
This condition is created by configuring two routers under systemd-networkd, either by having each router send an RA, or statically configuring one router at a time. I show the steps for the static configuration below.
Assuming you have an interface named "enp0s9" and you're using systemd-networkd as the network manager:
1. Configure the Host (HUT) with one router (TR1)
$ networkctl cat 10-enp0s9.network
# /etc/systemd/network/10-enp0s9.network
[Match]
Name=enp0s9
[Link]
RequiredForOnline=no
[Network]
Description="Internal Network: Private VM-to-VM IPv6 interface"
DHCP=no
LLDP=no
EmitLLDP=no
# /etc/systemd/network/10-enp0s9.network.d/address.conf
[Network]
Address=2001:2:0:1000:a00:27ff:fe5f:f72d/64
# /etc/systemd/network/10-enp0s9.network.d/route-1060.conf
[Route]
Gateway=fe80::200:10ff:fe10:1060
GatewayOnLink=true
2. Start or reload the configuration
$ sudo networkctl reload
$ sudo networkctl reconfigure enp0s9
$ ip -6 r
2001:2:0:1000::/64 dev enp0s9 proto kernel metric 256 pref medium
fe80::/64 dev enp0s3 proto kernel metric 256 pref medium
fe80::/64 dev enp0s9 proto kernel metric 256 pref medium
default via fe80::200:10ff:fe10:1060 dev enp0s9 proto static metric 1024 onlink pref medium
3. Flush and Monitor the neighbor cache
$ sudo ip -6 neigh flush all; ip -6 -ts monitor neigh
4. From TN1, ping HUT via TR1 - the HUT's NCE is updated to REACHABLE
[2024-06-28T08:13:27.617674] fe80::200:10ff:fe10:1060 dev enp0s9 lladdr 00:00:10:10:10:60 router REACHABLE
NOTE: tcpdump shows the expected protocol exchange.
5. Configure the Host (HUT) with a 2nd router (TR2)
$ cat /etc/systemd/network/10-enp0s9.network.d/route-1061.conf
[Route]
Gateway=fe80::200:10ff:fe10:1061
GatewayOnLink=true
$ sudo networkctl reload
$ sudo networkctl reconfigure enp0s9
$ ip -6 r
2001:2:0:1000::/64 dev enp0s9 proto kernel metric 256 pref medium
fe80::/64 dev enp0s3 proto kernel metric 256 pref medium
fe80::/64 dev enp0s9 proto kernel metric 256 pref medium
default proto static metric 1024 pref medium
???? nexthop via fe80::200:10ff:fe10:1061 dev enp0s9 weight 1
???? nexthop via fe80::200:10ff:fe10:1060 dev enp0s9 weight 1
6. Start monitoring traffic with tcpdump/WireShark
7. From TN1, ping HUT via TR1
a. An echo reply is never received
b. The protocol exchange shows the HUT sends a NS for TR2 (which is NOT REACHABLE) when it should have sent an echo-reply via TR1 (which is REACHABLE).
* OBSERVATIONS
1. When NOT using systemd-network and each router sends an RA, the kernel behaves correctly.
2. The routing table looks different, depending on whether the kernel adds the route or systemd-networkd adds the route. E.g.
a. Kernel adds two separate "default route" entries (systemd-networkd is stopped)
$ ip -6 route
<deleted lines>
default via fe80::200:10ff:fe10:1060 proto ra metric 1024 expires 39sec hoplimit 64 pref medium
default via fe80::200:10ff:fe10:1061 proto ra metric 1024 expires 44sec hoplimit 64 pref medium
b. Systemd-networkd adds one "default route" with two nexthop options (systemd-networkd is running)
$ ip -6 route
<deleted lines>
default proto ra metric 1024 expires 589sec pref medium
?nexthop via fe80::200:10ff:fe10:1060 dev enp0s9 weight 1
?nexthop via fe80::200:10ff:fe10:1061 dev enp0s9 weight 1
* TCPDUMP
For completeness, here is the annotated output from tcpdump.
$ tcpdump -r ~/v6LC_2_2_11-bug-report-summary.pcapng -t -n --number -e
reading from file /home/matt/v6LC_2_2_11-bug-report-summary.pcapng, link-type EN10MB (Ethernet), snapshot length 262144
?? ?# Step 4:? TN1(1181) pings HUT(f72d) via TR1(1060)
??? 1? 00:00:10:10:10:60 > 08:00:27:5f:f7:2d, ethertype IPv6 (0x86dd), length 70: 2001:2:0:1001:200:10ff:fe10:1181 > 2001:2:0:1000:a00:27ff:fe5f:f72d: ICMP6, echo request, id 0, seq 0, length 16
??? 2? 08:00:27:5f:f7:2d > 33:33:ff:10:10:60, ethertype IPv6 (0x86dd), length 86: 2001:2:0:1000:a00:27ff:fe5f:f72d > ff02::1:ff10:1060: ICMP6, neighbor solicitation, who has fe80::200:10ff:fe10:1060, length 32
??? 3? 00:00:10:10:10:60 > 08:00:27:5f:f7:2d, ethertype IPv6 (0x86dd), length 86: fe80::200:10ff:fe10:1060 > fe80::a00:27ff:fe5f:f72d: ICMP6, neighbor advertisement, tgt is fe80::200:10ff:fe10:1060, length 32
??? 4? 08:00:27:5f:f7:2d > 00:00:10:10:10:60, ethertype IPv6 (0x86dd), length 70: 2001:2:0:1000:a00:27ff:fe5f:f72d > 2001:2:0:1001:200:10ff:fe10:1181: ICMP6, echo reply, id 0, seq 0, length 16
??? # HUT has replied to TN1 via TR1.? NCE for TR1=REACHABLE
??? # Step 5: Now configure TR2
????# Step 7: ??TN1(1181) pings HUT(f72d) via TR1(1060)
??? 5? 00:00:10:10:10:60 > 08:00:27:5f:f7:2d, ethertype IPv6 (0x86dd), length 70: 2001:2:0:1001:200:10ff:fe10:1181 > 2001:2:0:1000:a00:27ff:fe5f:f72d: ICMP6, echo request, id 0, seq 0, length 16
??? # HUT creates an NCE for TR2=INCOMPLETE
?? ?# HUT incorrectly sends NS for TR2(1061) when it should have sent echo-reply via TR1(1060)
??? 6? 08:00:27:5f:f7:2d > 33:33:ff:10:10:61, ethertype IPv6 (0x86dd), length 86: 2001:2:0:1000:a00:27ff:fe5f:f72d > ff02::1:ff10:1061: ICMP6, neighbor solicitation, who has fe80::200:10ff:fe10:1061, length 32
??? 7? 08:00:27:5f:f7:2d > 33:33:ff:10:10:61, ethertype IPv6 (0x86dd), length 86: 2001:2:0:1000:a00:27ff:fe5f:f72d > ff02::1:ff10:1061: ICMP6, neighbor solicitation, who has fe80::200:10ff:fe10:1061, length 32
??? 8? 08:00:27:5f:f7:2d > 33:33:ff:10:10:61, ethertype IPv6 (0x86dd), length 86: 2001:2:0:1000:a00:27ff:fe5f:f72d > ff02::1:ff10:1061: ICMP6, neighbor solicitation, who has fe80::200:10ff:fe10:1061, length 32
Regards,
Matt.
Powered by blists - more mailing lists