[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <67f3e1cd271f1_38ecd3294ae@willemb.c.googlers.com.notmuch>
Date: Mon, 07 Apr 2025 10:31:41 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Ido Schimmel <idosch@...dia.com>,
Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: netdev@...r.kernel.org,
davem@...emloft.net,
pabeni@...hat.com,
edumazet@...gle.com,
dsahern@...nel.org,
horms@...nel.org,
gnault@...hat.com,
stfomichev@...il.com
Subject: Re: [PATCH net 1/2] ipv6: Start path selection from the first nexthop
Ido Schimmel wrote:
> On Sun, Apr 06, 2025 at 02:30:19PM -0400, Willem de Bruijn wrote:
> > Ido Schimmel wrote:
> > > Hi Willem,
> > >
> > > Thanks for taking a look
> > >
> > > On Fri, Apr 04, 2025 at 10:40:32AM -0400, Willem de Bruijn wrote:
> > > > Ido Schimmel wrote:
> > > > > Cited commit transitioned IPv6 path selection to use hash-threshold
> > > > > instead of modulo-N. With hash-threshold, each nexthop is assigned a
> > > > > region boundary in the multipath hash function's output space and a
> > > > > nexthop is chosen if the calculated hash is smaller than the nexthop's
> > > > > region boundary.
> > > > >
> > > > > Hash-threshold does not work correctly if path selection does not start
> > > > > with the first nexthop. For example, if fib6_select_path() is always
> > > > > passed the last nexthop in the group, then it will always be chosen
> > > > > because its region boundary covers the entire hash function's output
> > > > > space.
> > > > >
> > > > > Fix this by starting the selection process from the first nexthop and do
> > > > > not consider nexthops for which rt6_score_route() provided a negative
> > > > > score.
> > > > >
> > > > > Fixes: 3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N")
> > > > > Reported-by: Stanislav Fomichev <stfomichev@...il.com>
> > > > > Closes: https://lore.kernel.org/netdev/Z9RIyKZDNoka53EO@mini-arch/
> > > > > Signed-off-by: Ido Schimmel <idosch@...dia.com>
> > > > > ---
> > > > > net/ipv6/route.c | 38 +++++++++++++++++++++++++++++++++++---
> > > > > 1 file changed, 35 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > > > > index c3406a0d45bd..864f0002034b 100644
> > > > > --- a/net/ipv6/route.c
> > > > > +++ b/net/ipv6/route.c
> > > > > @@ -412,11 +412,35 @@ static bool rt6_check_expired(const struct rt6_info *rt)
> > > > > return false;
> > > > > }
> > > > >
> > > > > +static struct fib6_info *
> > > > > +rt6_multipath_first_sibling_rcu(const struct fib6_info *rt)
> > > > > +{
> > > > > + struct fib6_info *iter;
> > > > > + struct fib6_node *fn;
> > > > > +
> > > > > + fn = rcu_dereference(rt->fib6_node);
> > > > > + if (!fn)
> > > > > + goto out;
> > > > > + iter = rcu_dereference(fn->leaf);
> > > > > + if (!iter)
> > > > > + goto out;
> > > > > +
> > > > > + while (iter) {
> > > > > + if (iter->fib6_metric == rt->fib6_metric &&
> > > > > + rt6_qualify_for_ecmp(iter))
> > > > > + return iter;
> > > > > + iter = rcu_dereference(iter->fib6_next);
> > > > > + }
> > > > > +
> > > > > +out:
> > > > > + return NULL;
> > > > > +}
> > > >
> > > > The rcu counterpart to rt6_multipath_first_sibling, which is used when
> > > > computing the ranges in rt6_multipath_rebalance.
> > >
> > > Right
> > >
> > > >
> > > > > +
> > > > > void fib6_select_path(const struct net *net, struct fib6_result *res,
> > > > > struct flowi6 *fl6, int oif, bool have_oif_match,
> > > > > const struct sk_buff *skb, int strict)
> > > > > {
> > > > > - struct fib6_info *match = res->f6i;
> > > > > + struct fib6_info *first, *match = res->f6i;
> > > > > struct fib6_info *sibling;
> > > > >
> > > > > if (!match->nh && (!match->fib6_nsiblings || have_oif_match))
> > > > > @@ -440,10 +464,18 @@ void fib6_select_path(const struct net *net, struct fib6_result *res,
> > > > > return;
> > > > > }
> > > > >
> > > > > - if (fl6->mp_hash <= atomic_read(&match->fib6_nh->fib_nh_upper_bound))
> > > > > + first = rt6_multipath_first_sibling_rcu(match);
> > > > > + if (!first)
> > > > > goto out;
> > > > >
> > > > > - list_for_each_entry_rcu(sibling, &match->fib6_siblings,
> > > > > + if (fl6->mp_hash <= atomic_read(&first->fib6_nh->fib_nh_upper_bound) &&
> > > > > + rt6_score_route(first->fib6_nh, first->fib6_flags, oif,
> > > > > + strict) >= 0) {
> > > >
> > > > Does this fix address two issues in one patch: start from the first
> > > > sibling, and check validity of the sibling?
> > >
> > > The loop below will only choose a nexthop ('match = sibling') if its
> > > score is not negative. The purpose of the check here is to do the same
> > > for the first nexthop. That is, only choose a nexthop when calculated
> > > hash is smaller than the nexthop's region boundary and the nexthop has a
> > > non negative score.
> > >
> > > This was not done before for 'match' because the caller already chose
> > > 'match' based on its score.
> > >
> > > > The behavior on negative score for the first_sibling appears
> > > > different from that on subsequent siblings in the for_each below:
> > > > in that case the loop breaks, while for the first it skips?
> > > >
> > > > if (fl6->mp_hash > nh_upper_bound)
> > > > continue;
> > > > if (rt6_score_route(nh, sibling->fib6_flags, oif, strict) < 0)
> > > > break;
> > > > match = sibling;
> > > > break;
> > > >
> > > > Am I reading that correct and is that intentional?
> > >
> > > Hmm, I see. I think it makes sense to have the same behavior for all
> > > nexthops. That is, if nexthop fits in terms of hash but has a negative
> > > score, then fallback to 'match'. How about the following diff?
> >
> > That unifies the behavior.
> >
> > Is match guaranteed to be an acceptable path, i.e., having a positive
> > score?
>
> It can be negative (-1) if there isn't a neighbour associated with the
> nexthop which isn't necessarily a bad sign. Even if this is the case,
> it's the nexthop the kernel chose after evaluating the others.
>
> > Else just the first valid sibling after the matching, but invalid,
> > sibling, may be the most robust solution.
>
> AFAICT, the kernel has been falling back to 'match' upon a negative
> sibling score since 2013, so my preference would be to keep this
> behavior.
Good point.
Powered by blists - more mailing lists