[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ0PR84MB2088B1B93C75A4AAC5B90490D8632@SJ0PR84MB2088.NAMPRD84.PROD.OUTLOOK.COM>
Date: Thu, 19 Sep 2024 01:43:03 +0000
From: "Muggeridge, Matt" <matt.muggeridge2@....com>
To: "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: [PATCH net-next] Netlink flag for creating IPv6 Default Routes
>From 95c6e5c898d3eef3e6b37e9b6e238bf6b65cc57b Mon Sep 17 00:00:00 2001
From: Matt Muggeridge <Matt.Muggeridge@....com>
Date: Wed, 18 Sep 2024 21:29:31 -0400
Subject: [PATCH net-next] Netlink flag for creating IPv6 Default Routes
For IPv6, there is an issue where a netlink client is unable to create
default routes in the same manner as the kernel. This led to failures
when there are multiple default routers, as they were being coalesced
into a single ECMP route. When one of the ECMP default routers becomes
UNREACHABLE, it was still being selected as the nexthop.
When the kernel processes the RAs from multiple default routers, it sets
the fib6_flags: RTF_ADDRCONF | RTF_DEFAULT. The RTF_ADDRCONF flag is
checked by rt6_qualify_for_ecmp(), which returns false when ADDRCONF is
set. As such, the kernel creates separate default routes.
E.g. compare the routing tables when RAs are processed by the kernel
versus a netlink client (systemd-networkd in my case).
1) RA Processed by kernel (accept_ra = 2)
$ ip -6 route
2001:2:0:1000::/64 dev enp0s9 proto kernel metric 256 expires 65531sec pref medium
fe80::/64 dev enp0s9 proto kernel metric 256 pref medium
default via fe80::200:10ff:fe10:1060 dev enp0s9 proto ra metric 1024 expires 595sec hoplimit 64 pref medium
default via fe80::200:10ff:fe10:1061 dev enp0s9 proto ra metric 1024 expires 596sec hoplimit 64 pref medium
2) RA Processed by netlink client (accept_ra = 0)
$ ip -6 route
2001:2:0:1000::/64 dev enp0s9 proto ra metric 1024 expires 65531sec pref medium
fe80::/64 dev enp0s3 proto kernel metric 256 pref medium
fe80::/64 dev enp0s9 proto kernel metric 256 pref medium
default proto ra metric 1024 expires 595sec pref medium
	nexthop via fe80::200:10ff:fe10:1060 dev enp0s9 weight 1
	nexthop via fe80::200:10ff:fe10:1061 dev enp0s9 weight 1
IPv6 Netlink clients need a mechanism to identify a route as coming from
an RA. i.e. a netlink client needs a method to set the kernel flags:
    RTF_ADDRCONF | RTF_DEFAULT
This is needed when there are multiple default routers that each send
an RA. Setting the RTF_ADDRCONF flag ensures their fib entries do not
qualify for ECMP routes, see rt6_qualify_for_ecmp().
To achieve this, introduced a user-level flag RTM_F_RA_ROUTER that a
netlink client can pass to the kernel.
A Netlink user-level network manager, such as systemd-networkd, may set
the RTM_F_RA_ROUTER flag in the Netlink RTM_NEWROUTE rtmsg. When set,
the kernel sets RTF_RA_ROUTER in the fib6_config fc_flags. This causes a
default route to be created in the same way as if the kernel processed
the RA, via rt6add_dflt_router().
This is needed by user-level network managers, like systemd-networkd,
that prefer to do the RA processing themselves. ie. they disable the
kernel's RA processing by setting net.ipv6.conf.<intf>.accept_ra=0.
Without this flag, when there are mutliple default routers, the kernel
coalesces multiple default routes into an ECMP route. The ECMP route
ignores per-route REACHABILITY information. If one of the default
routers is unresponsive, with a Neighbor Cache entry of INCOMPLETE, then
it can still be selected as the nexthop for outgoing packets. This
results in an inability to communicate with remote hosts, even though
one of the default routers remains REACHABLE. This violates RFC4861
6.3.6 bullet 1.
Extract from RFC4861 6.3.6 bullet 1:
     1) Routers that are reachable or probably reachable (i.e., in any
        state other than INCOMPLETE) SHOULD be preferred over routers
        whose reachability is unknown or suspect (i.e., in the
        INCOMPLETE state, or for which no Neighbor Cache entry exists).
        Further implementation hints on default router selection when
        multiple equivalent routers are available are discussed in
This fixes the IPv6 Logo conformance test v6LC_2_2_11, and others that
test witth multiple default routers. Also see systemd issue #33470:
https://github.com/systemd/systemd/issues/33470.
---
 include/uapi/linux/rtnetlink.h | 1 +
 net/ipv6/route.c               | 3 +++
 2 files changed, 4 insertions(+)
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 3b687d20c9ed..9d80926316b3 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -336,6 +336,7 @@ enum rt_scope_t {
 #define RTM_F_FIB_MATCH	        0x2000	/* return full fib lookup match */
 #define RTM_F_OFFLOAD		0x4000	/* route is offloaded */
 #define RTM_F_TRAP		0x8000	/* route is trapping packets */
+#define RTM_F_RA_ROUTER		0x10000	/* route is a default route from RA */
 #define RTM_F_OFFLOAD_FAILED	0x20000000 /* route offload failed, this value
 					    * is chosen to avoid conflicts with
 					    * other flags defined in
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b4251915585f..5b0c16422720 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -5055,6 +5055,9 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
 	if (rtm->rtm_flags & RTM_F_CLONED)
 		cfg->fc_flags |= RTF_CACHE;
 
+	if (rtm->rtm_flags & RTM_F_RA_ROUTER)
+		cfg->fc_flags |= RTF_RA_ROUTER;
+
 	cfg->fc_flags |= (rtm->rtm_flags & RTNH_F_ONLINK);
 
 	if (tb[RTA_NH_ID]) {
-- 
2.35.3
Powered by blists - more mailing lists
 
