[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56F49CD3.1060406@cumulusnetworks.com>
Date: Thu, 24 Mar 2016 20:05:07 -0600
From: David Ahern <dsa@...ulusnetworks.com>
To: Julian Anastasov <ja@....bg>
Cc: netdev@...r.kernel.org
Subject: Re: [PATCH net] net: ipv4: Multipath needs to handle unreachable
nexthops
On 3/24/16 4:33 PM, Julian Anastasov wrote:
> But for multipath routes we can also consider the
> nexthops as "alternatives", so it depends on how one uses
> the multipath mechanism. The ability to fallback to
> another nexthop assumes one connection is allowed to
> move from one ISP to another. What if the second ISP
> decides to reject the connection? What we have is a
> broken connection just because the retransmits
> were diverted to wrong place in the hurry. So, the
> nexthops can be compatible or incompatible. For your
> setup they are, for others they are not.
I am not sure I completely understand your point. Are you saying that
within a single multipath route some connections to nexthops are allowed
and others are not?
So to put that paragraph into an example
15.0.0.0/16
nexthop via 12.0.0.2 dev swp1 weight 1
nexthop via 12.0.0.3 dev swp1 weight 1
Hosts from 15.0/16 could have TCP connections use 12.0.0.2, but not
12.0.0.3 because 12.0.0.3 could be a different ISP and not allow TCP
connections from this address space?
>
> For multipath setups I recall for the following
> possible cases:
>
> 1. Every packet from connection hits multipath route
> and what we want is the connection to use only one
> nexthop. If nexthop fails then the connection fails.
> Latest changes for the multipath algorithm chose this
> option: use hash to map traffic to nexthop and any
> fallbacks to another nexthop are not allowed because
> we should follow the hash mapping. Works when nexthops
> use different ISPs.
>
> 2. Only the first packet hits multipath route. With
> the help from CONNMARK all next packets from connection
> use the same nexthop by using another unicast route.
> The goal here is the multipath route to be used just to
> balance connections. This works best when the nexthop
> selection was random and not a hash based. This was
> how the previous algorithm worked. Fallbacks are
> desired because it is fine to select alive nexthop
> for the first packet, but wrong for the next packets.
>
> My preference was for the random selection,
> I don't know why we restricted the new algorithm. May be
> because in the common case when just a single default
> multipath route is created we want to use all nexthops
> in a ideal world where all nexthops are alive.
>
> So, if the kernel used a random selection
> your fallback algorithm should help. But it is fragile
> for the simple setup with single default multipath route.
> May be what we miss is the ability to choose between
> random and hash-based selection. Then your patch may be
> useful but only for setup 2 (multipath route hit only by
> first packet). So, your patch may come with a sysctl var
> that explains your current patch logic: "avoid failed nexthops,
> never probe them, wait their failed entry to be expired by
> neigh_periodic_work and just then we can use the nexthop
> by creating new entry to probe the GW". Who will trigger
> probes often enough to maintain fresh state?
First packet out does the probe -- neigh lookup fails, kernel has no
knowledge so can't reject the nexthop based on neighbor information.
After that if it has information that says that a nexthop is dead, why
would it continue to try to probe? Any traffic that selects that nh is
dead. That to me defies the basis of having multiple paths.
>
> Also, one may argue that such decisions should
> be done in user space. It is common the direct routers
> to answer ARP even while their uplink is down. In
> the common case, one may need to ping 2-3 indirect
> gateways to decide if a path is alive and then to recreate
> the default multipath route.
>
> More comments below...
>
>> Signed-off-by: David Ahern <dsa@...ulusnetworks.com>
>> ---
>> net/ipv4/fib_semantics.c | 34 ++++++++++++++++++++++++++++++++--
>> 1 file changed, 32 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
>> index d97268e8ff10..28fc6700c2b1 100644
>> --- a/net/ipv4/fib_semantics.c
>> +++ b/net/ipv4/fib_semantics.c
>> @@ -1563,13 +1563,43 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
>> void fib_select_multipath(struct fib_result *res, int hash)
>> {
>> struct fib_info *fi = res->fi;
>> + struct neighbour *n;
>> + int state;
>>
>> for_nexthops(fi) {
>> if (hash > atomic_read(&nh->nh_upper_bound))
>> continue;
>>
>> - res->nh_sel = nhsel;
>> - return;
>> + state = NUD_NONE;
>> + n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev);
>
> Sometimes nh_gw is a local IP, you can call
> neigh_lookup (or a lockless RCU-BH variant) only for the
> nh_scope == RT_SCOPE_LINK case, just like in fib_select_default.
got it, will change.
>
>> + if (n) {
>> + state = n->nud_state;
>> + neigh_release(n);
>> + }
>> + if (!n || (state == NUD_REACHABLE) || (state & NUD_VALID)) {
>
> NUD_REACHABLE is part of NUD_VALID, so this is shorter:
>
> if (!n || (state & NUD_VALID)) {
sure.
>
>> + res->nh_sel = nhsel;
>> + return;
>> + }
>> + } endfor_nexthops(fi);
>> +
>> + /* try the nexthops again, but covering the entries
>> + * skipped by the hash
>> + */
>> + fi = res->fi;
>> + for_nexthops(fi) {
>> + if (hash <= atomic_read(&nh->nh_upper_bound))
>
> This is dangerous, we can try a RTNH_F_DEAD entry
> with nh_upper_bound = -1. This can work with 2 checks:
In v2 of my patch I dropped this change as it technically is not needed
for the case I am trying to solve.
>
> if (hash <= atomic_read(&nh->nh_upper_bound))
> break;
> if (atomic_read(&nh->nh_upper_bound) < 0)
> continue;
>
>> + continue;
>> +
>> + state = NUD_NONE;
>> + n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev);
>> + if (n) {
>> + state = n->nud_state;
>> + neigh_release(n);
>> + }
>> + if (!n || (state == NUD_REACHABLE) || (state & NUD_VALID)) {
>> + res->nh_sel = nhsel;
>> + return;
>> + }
>> } endfor_nexthops(fi);
>
> If all are failed why not use the nexthop that was
> selected by the hash? Even if it is failed, new ARP probe
> can succeed.
sure.
Thanks for the comments.
David
Powered by blists - more mailing lists