netdev - Re: [PATCH net-next 2/3 v2] net: ipv4 sysctl option to ignore routes when nexthop link is down

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150610190402.GS588@gospo.home.greyhouse.net>
Date:	Wed, 10 Jun 2015 15:04:03 -0400
From:	Andy Gospodarek <gospo@...ulusnetworks.com>
To:	Alexander Duyck <alexander.h.duyck@...hat.com>
Cc:	netdev@...r.kernel.org, davem@...emloft.net,
	ddutt@...ulusnetworks.com, sfeldma@...il.com,
	alexander.duyck@...il.com, hannes@...essinduktion.org,
	stephen@...workplumber.org
Subject: Re: [PATCH net-next 2/3 v2] net: ipv4 sysctl option to ignore routes
 when nexthop link is down

On Wed, Jun 10, 2015 at 09:17:19AM -0700, Alexander Duyck wrote:
> 
> 
> On 06/09/2015 11:47 PM, Andy Gospodarek wrote:
> >This feature is only enabled with the new per-interface or ipv4 global
> >sysctls called 'ignore_routes_with_linkdown'.
> >
> >net.ipv4.conf.all.ignore_routes_with_linkdown = 0
> >net.ipv4.conf.default.ignore_routes_with_linkdown = 0
> >net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
> >...
> >
> >When the above sysctls are set, will report to userspace that a route is
> >dead and will no longer resolve to this nexthop when performing a fib
> >lookup.  This will signal to userspace that the route will not be
> >selected.  The signalling of a RTNH_F_DEAD is only passed to userspace
> >if the sysctl is enabled and link is down.  This was done as without it the
> >netlink listeners would have no idea whether or not a nexthop would be
> >selected.   The kernel only sets RTNH_F_DEAD internally if the inteface has
> >IFF_UP cleared.
> >
> >With the new sysctl set, the following behavior can be observed
> >(interface p8p1 is link-down):
> >
> ># ip route show
> >default via 10.0.5.2 dev p9p1
> >10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
> >70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
> >80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown
> >90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead linkdown
> >90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
> ># ip route get 90.0.0.1
> >90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1
> >     cache
> ># ip route get 80.0.0.1
> >local 80.0.0.1 dev lo  src 80.0.0.1
> >     cache <local>
> ># ip route get 80.0.0.2
> >80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15
> >     cache
> >
> >While the route does remain in the table (so it can be modified if
> >needed rather than being wiped away as it would be if IFF_UP was
> >cleared), the proper next-hop is chosen automatically when the link is
> >down.  Now interface p8p1 is linked-up:
> >
> ># ip route show
> >default via 10.0.5.2 dev p9p1
> >10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
> >70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
> >80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1
> >90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1
> >90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
> >192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2
> ># ip route get 90.0.0.1
> >90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1
> >     cache
> ># ip route get 80.0.0.1
> >local 80.0.0.1 dev lo  src 80.0.0.1
> >     cache <local>
> ># ip route get 80.0.0.2
> >80.0.0.2 dev p8p1  src 80.0.0.1
> >     cache
> >
> >and the output changes to what one would expect.
> >
> >If the sysctl is not set, the following output would be expected when
> >p8p1 is down:
> >
> ># ip route show
> >default via 10.0.5.2 dev p9p1
> >10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
> >70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
> >80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 linkdown
> >90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 linkdown
> >90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
> >
> >Since the dead flag does not appear, there should be no expectation that
> >the kernel would skip using this route due to link being down.
> >
> >v2: Split kernel changes into 2 patches, this actually makes a
> >behavioral change if the sysctl is set.  Also took suggestion from Alex
> >to simplify code by only checking sysctl during fib lookup and
> >suggestion from Scott to add a per-interface sysctl.
> >
> >Signed-off-by: Andy Gospodarek <gospo@...ulusnetworks.com>
> >Signed-off-by: Dinesh Dutt <ddutt@...ulusnetworks.com>
> >---
> >  include/linux/inetdevice.h        |  3 +++
> >  include/net/fib_rules.h           |  3 ++-
> >  include/net/ip_fib.h              | 17 ++++++++++-------
> >  include/uapi/linux/ip.h           |  1 +
> >  include/uapi/linux/sysctl.h       |  1 +
> >  kernel/sysctl_binary.c            |  1 +
> >  net/ipv4/devinet.c                |  2 ++
> >  net/ipv4/fib_frontend.c           |  6 +++---
> >  net/ipv4/fib_rules.c              |  5 +++--
> >  net/ipv4/fib_semantics.c          | 28 ++++++++++++++++++++++------
> >  net/ipv4/fib_trie.c               |  7 +++++++
> >  net/ipv4/netfilter/ipt_rpfilter.c |  2 +-
> >  net/ipv4/route.c                  | 10 +++++-----
> >  13 files changed, 61 insertions(+), 25 deletions(-)
[...]
> >diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> >index d1de1b7..854d790 100644
> >--- a/include/net/ip_fib.h
> >+++ b/include/net/ip_fib.h
> >@@ -266,11 +267,13 @@ static inline int fib_lookup(struct net *net, struct flowi4 *flp,
> >
> >  	for (err = 0; !err; err = -ENETUNREACH) {
> >  		tb = rcu_dereference_rtnl(net->ipv4.fib_main);
> >-		if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF))
> >+		if (tb && !fib_table_lookup(tb, flp, res,
> >+					    flags | FIB_LOOKUP_NOREF))
> >  			break;
> >
> >  		tb = rcu_dereference_rtnl(net->ipv4.fib_default);
> >-		if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF))
> >+		if (tb && !fib_table_lookup(tb, flp, res,
> >+					    flags | FIB_LOOKUP_NOREF))
> >  			break;
> >  	}
> >
> 
> Instead of 3 lines w/ flags | FIB_LOOKUP_NOREF you could probably just do a
> flags |= FIB_LOOKUP_NOREF once and save yourself some trouble.
Sure.  But I get credit for less lines that way.  ;-)

[...]
> >@@ -319,7 +319,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
> >  	fl4.flowi4_mark = IN_DEV_SRC_VMARK(idev) ? skb->mark : 0;
> >
> >  	net = dev_net(dev);
> >-	if (fib_lookup(net, &fl4, &res))
> >+	if (fib_lookup(net, &fl4, &res, 0))
> >  		goto last_resort;
> >  	if (res.type != RTN_UNICAST &&
> >  	    (res.type != RTN_LOCAL || !IN_DEV_ACCEPT_LOCAL(idev)))
> >@@ -354,7 +354,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
> >  	fl4.flowi4_oif = dev->ifindex;
> >
> >  	ret = 0;
> >-	if (fib_lookup(net, &fl4, &res) == 0) {
> >+	if (fib_lookup(net, &fl4, &res, 0) == 0) {
> >  		if (res.type == RTN_UNICAST)
> >  			ret = FIB_RES_NH(res).nh_scope >= RT_SCOPE_HOST;
> >  	}
> 
> The code for validating a source could probably ignore the LINKDOWN message.
> Otherwise we run the risk of a link flapping and confusing the source since
> the link is down but any Rx packets in the rings are being flushed.
Excellent point.  After thinking about this a bit, I think you are
correct that we would want to consider a dead link or an alive link as a
valid interface for receiving traffic.  Flag added for v3.

[...]
> >@@ -1057,11 +1062,16 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
> >  			goto nla_put_failure;
> >
> >  		for_nexthops(fi) {
> >+			struct in_device *in_dev = __in_dev_get_rcu(nh->nh_dev);
> >  			rtnh = nla_reserve_nohdr(skb, sizeof(*rtnh));
> >  			if (!rtnh)
> >  				goto nla_put_failure;
> >
> >-			rtnh->rtnh_flags = nh->nh_flags & 0xFF;
> >+			if (in_dev && IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
> >+			    nh->nh_flags & RTNH_F_LINKDOWN)
> >+				rtnh->rtnh_flags = (nh->nh_flags | RTNH_F_DEAD) & 0xFF;
> >+			else
> >+				rtnh->rtnh_flags = nh->nh_flags & 0xFF;
> >  			rtnh->rtnh_hops = nh->nh_weight - 1;
> >  			rtnh->rtnh_ifindex = nh->nh_oif;
> >
> 
> Why not just split this if into two seperate statments? One taking care of
> the first setting of rtnh_flags and then a second one ORing in the
> RTNH_F_DEAD.
If that seems easier to maintain, I can do that for v3.

[...]
> >diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> >index 3c699c4..f75ca20 100644
> >--- a/net/ipv4/fib_trie.c
> >+++ b/net/ipv4/fib_trie.c
> >@@ -1407,11 +1407,18 @@ found:
> >  		}
> >  		if (fi->fib_flags & RTNH_F_DEAD)
> >  			continue;
> >+
> >  		for (nhsel = 0; nhsel < fi->fib_nhs; nhsel++) {
> >  			const struct fib_nh *nh = &fi->fib_nh[nhsel];
> >+			struct in_device *in_dev = __in_dev_get_rcu(nh->nh_dev);
> >
> >  			if (nh->nh_flags & RTNH_F_DEAD)
> >  				continue;
> >+			if (in_dev &&
> >+			    IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
> >+			    nh->nh_flags & RTNH_F_LINKDOWN &&
> >+			    !(fib_flags & FIB_LOOKUP_IGNORE_LINKSTATE))
> >+				continue;
> >  			if (flp->flowi4_oif && flp->flowi4_oif != nh->nh_oif)
> >  				continue;
> >
> 
> The order of checks should be:
> 	1.  (nh->nh_flags & RTNH_F_LINKDOWN)
> 	2.  !(fib_flags & FIB_LOOKUP_IGNORE_LINKSTATE)
This one is not needed as we will not have this flag set anywhere but 1,
3, and 4 in that order seems cleaner.

> 	3.  in_dev
> 	4. IGNORE_ROUTES_WITH_LINKDOWN
> 
> That way we don't waste time checking the in_dev if the link isn't reported
> as being down.  Also I would probably move the whole block inside an if
> statement based off of the first 2 checks since nothing else is making use
> of in_dev.
This seems like a nice optimization.  I'll do it here and above outside
the nh loop.

> 
> >diff --git a/net/ipv4/netfilter/ipt_rpfilter.c b/net/ipv4/netfilter/ipt_rpfilter.c
> >index 4bfaedf..250c633 100644
> >--- a/net/ipv4/netfilter/ipt_rpfilter.c
> >+++ b/net/ipv4/netfilter/ipt_rpfilter.c
> >@@ -40,7 +40,7 @@ static bool rpfilter_lookup_reverse(struct flowi4 *fl4,
> >  	struct net *net = dev_net(dev);
> >  	int ret __maybe_unused;
> >
> >-	if (fib_lookup(net, fl4, &res))
> >+	if (fib_lookup(net, fl4, &res, 0))
> >  		return false;
> >
> >  	if (res.type != RTN_UNICAST) {
> 
> Any rpfilter stuff can probably ignore the linkdown check since it is
> possible that a driver could be flushing data just after a link went down.
Agreed based on thoughts from __fib_validate_source.

Thanks for this review, too.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html