[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.2.11.1604072157230.2154@ja.home.ssi.bg>
Date: Thu, 7 Apr 2016 21:58:55 +0300 (EEST)
From: Julian Anastasov <ja@....bg>
To: David Ahern <dsa@...ulusnetworks.com>
cc: netdev@...r.kernel.org
Subject: Re: [PATCH v6 net-next] net: ipv4: Consider failed nexthops in
multipath routes
Hello,
On Thu, 7 Apr 2016, David Ahern wrote:
> Multipath route lookups should consider knowledge about next hops and not
> select a hop that is known to be failed.
>
> Example:
>
> [h2] [h3] 15.0.0.5
> | |
> 3| 3|
> [SP1] [SP2]--+
> 1 2 1 2
> | | /-------------+ |
> | \ / |
> | X |
> | / \ |
> | / \---------------\ |
> 1 2 1 2
> 12.0.0.2 [TOR1] 3-----------------3 [TOR2] 12.0.0.3
> 4 4
> \ /
> \ /
> \ /
> -------| |-----/
> 1 2
> [TOR3]
> 3|
> |
> [h1] 12.0.0.1
>
> host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:
>
> root@h1:~# ip ro ls
> ...
> 12.0.0.0/24 dev swp1 proto kernel scope link src 12.0.0.1
> 15.0.0.0/16
> nexthop via 12.0.0.2 dev swp1 weight 1
> nexthop via 12.0.0.3 dev swp1 weight 1
> ...
>
> If the link between tor3 and tor1 is down and the link between tor1
> and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
> in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
> ssh 15.0.0.5 gets the other. Connections that attempt to use the
> 12.0.0.2 nexthop fail since that neighbor is not reachable:
>
> root@h1:~# ip neigh show
> ...
> 12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
> 12.0.0.2 dev swp1 FAILED
> ...
>
> The failed path can be avoided by considering known neighbor information
> when selecting next hops. If the neighbor lookup fails we have no
> knowledge about the nexthop, so give it a shot. If there is an entry
> then only select the nexthop if the state is sane. This is similar to
> what fib_detect_death does.
>
> To maintain backward compatibility use of the neighbor information is
> based on a new sysctl, fib_multipath_use_neigh.
>
> Signed-off-by: David Ahern <dsa@...ulusnetworks.com>
Reviewed-by: Julian Anastasov <ja@....bg>
> ---
> v6
> - changed __neigh_lookup_noref to __ipv4_neigh_lookup_noref per Dave's
> comment
>
> v5
> - returned comma that got lost in the ether and removed resetting of
> nhsel at end of loop - again comments from Julian
>
> v4
> - remove NULL initializer and logic for fallback per Julian's comment
>
> v3
> - Julian comments: changed use of dead in documentation to failed,
> init state to NUD_REACHABLE which simplifies fib_good_nh, use of
> nh_dev for neighbor lookup, fallback to first entry which is what
> current logic does
>
> v2
> - use rcu locking to avoid refcnts per Eric's suggestion
> - only consider neighbor info for nh_scope == RT_SCOPE_LINK per Julian's
> comment
> - drop the 'state == NUD_REACHABLE' from the state check since it is
> part of NUD_VALID (comment from Julian)
> - wrapped the use of the neigh in a sysctl
>
> Documentation/networking/ip-sysctl.txt | 10 ++++++++++
> include/net/netns/ipv4.h | 3 +++
> net/ipv4/fib_semantics.c | 34 +++++++++++++++++++++++++++++-----
> net/ipv4/sysctl_net_ipv4.c | 11 +++++++++++
> 4 files changed, 53 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index b183e2b606c8..6c7f365b1515 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -63,6 +63,16 @@ fwmark_reflect - BOOLEAN
> fwmark of the packet they are replying to.
> Default: 0
>
> +fib_multipath_use_neigh - BOOLEAN
> + Use status of existing neighbor entry when determining nexthop for
> + multipath routes. If disabled, neighbor information is not used and
> + packets could be directed to a failed nexthop. Only valid for kernels
> + built with CONFIG_IP_ROUTE_MULTIPATH enabled.
> + Default: 0 (disabled)
> + Possible values:
> + 0 - disabled
> + 1 - enabled
> +
> route/max_size - INTEGER
> Maximum number of routes allowed in the kernel. Increase
> this when using large numbers of interfaces and/or routes.
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index a69cde3ce460..d061ffeb1e71 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -133,6 +133,9 @@ struct netns_ipv4 {
> struct fib_rules_ops *mr_rules_ops;
> #endif
> #endif
> +#ifdef CONFIG_IP_ROUTE_MULTIPATH
> + int sysctl_fib_multipath_use_neigh;
> +#endif
> atomic_t rt_genid;
> };
> #endif
> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index d97268e8ff10..ab64d9f2eef9 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -1559,21 +1559,45 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
> }
>
> #ifdef CONFIG_IP_ROUTE_MULTIPATH
> +static bool fib_good_nh(const struct fib_nh *nh)
> +{
> + int state = NUD_REACHABLE;
> +
> + if (nh->nh_scope == RT_SCOPE_LINK) {
> + struct neighbour *n;
> +
> + rcu_read_lock_bh();
> +
> + n = __ipv4_neigh_lookup_noref(nh->nh_dev, nh->nh_gw);
> + if (n)
> + state = n->nud_state;
> +
> + rcu_read_unlock_bh();
> + }
> +
> + return !!(state & NUD_VALID);
> +}
>
> void fib_select_multipath(struct fib_result *res, int hash)
> {
> struct fib_info *fi = res->fi;
> + struct net *net = fi->fib_net;
> + bool first = false;
>
> for_nexthops(fi) {
> if (hash > atomic_read(&nh->nh_upper_bound))
> continue;
>
> - res->nh_sel = nhsel;
> - return;
> + if (!net->ipv4.sysctl_fib_multipath_use_neigh ||
> + fib_good_nh(nh)) {
> + res->nh_sel = nhsel;
> + return;
> + }
> + if (!first) {
> + res->nh_sel = nhsel;
> + first = true;
> + }
> } endfor_nexthops(fi);
> -
> - /* Race condition: route has just become dead. */
> - res->nh_sel = 0;
> }
> #endif
>
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 1e1fe6086dd9..bb0419582b8d 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -960,6 +960,17 @@ static struct ctl_table ipv4_net_table[] = {
> .mode = 0644,
> .proc_handler = proc_dointvec,
> },
> +#ifdef CONFIG_IP_ROUTE_MULTIPATH
> + {
> + .procname = "fib_multipath_use_neigh",
> + .data = &init_net.ipv4.sysctl_fib_multipath_use_neigh,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = &zero,
> + .extra2 = &one,
> + },
> +#endif
> { }
> };
>
> --
> 2.1.4
Regards
--
Julian Anastasov <ja@....bg>
Powered by blists - more mailing lists