[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ee82bed1-7033-406d-4738-cab6d0ec8fb6@gmail.com>
Date: Sat, 22 Jan 2022 09:55:18 -0700
From: David Ahern <dsahern@...il.com>
To: Tomas Hlavacek <tmshlvck@...il.com>,
"David S . Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
David Ahern <dsahern@...nel.org>
Cc: netdev@...r.kernel.org, Stefano Brivio <sbrivio@...hat.com>,
Ido Schimmel <idosch@...sch.org>
Subject: Re: [RFC PATCH] ipv4: fix fnhe dump record multiplication
[ cc Stefano and Ido ]
On 1/20/22 4:50 PM, Tomas Hlavacek wrote:
> Fix the multiplication of the FNHE records during dump: Check in
> fnhe_dump_bucket() that the dumped record destination address falls
> within the key (prefix, prefixlen) of the FIB leaf that is being dumped.
>
> FNHE table records can be dumped multiple times to netlink on RTM_GETROUTE
> command with NLM_F_DUMP flag - either to "ip route show cache" or to any
> routing daemon. The multiplication is substantial under specific
> conditions - it can produce over 120M netlink messages in one dump.
> It happens if there is one shared struct fib_nh linked through
> struct fib_info (->fib_nh) from many leafs in FIB over struct fib_alias.
>
> This situation can be triggered by importing a full BGP table over GRE
> tunnel. In this case there are ~800k routes that translates to ~120k leafs
> in FIB that all ulimately links the same next-hop (the other end of
> the GRE tunnel). The GRE tunnel creates one FNHE record for each
> destination IP that is routed to the tunnel because of PMTU. In my case
> I had around 1000 PMTU records after a few minutes in a lab connected to
> the public internet so the FNHE dump produced 120M records that easily
> stalled BIRD routing daemon as described here:
> http://trubka.network.cz/pipermail/bird-users/2022-January/015897.html
> (There is a work-around already committed to BIRD that removes unnecessary
> dumps of FNHE.)
>
> Signed-off-by: Tomas Hlavacek <tmshlvck@...il.com>
> ---
> include/net/route.h | 3 ++-
> net/ipv4/fib_trie.c | 3 ++-
> net/ipv4/route.c | 25 ++++++++++++++++++++++---
> 3 files changed, 26 insertions(+), 5 deletions(-)
>
> diff --git a/include/net/route.h b/include/net/route.h
> index 2e6c0e153e3a..066eab9c5d99 100644
> --- a/include/net/route.h
> +++ b/include/net/route.h
> @@ -244,7 +244,8 @@ void rt_del_uncached_list(struct rtable *rt);
>
> int fib_dump_info_fnhe(struct sk_buff *skb, struct netlink_callback *cb,
> u32 table_id, struct fib_info *fi,
> - int *fa_index, int fa_start, unsigned int flags);
> + int *fa_index, int fa_start, unsigned int flags,
> + __be32 prefix, unsigned char prefixlen);
>
> static inline void ip_rt_put(struct rtable *rt)
> {
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 8060524f4256..7a42db70f46d 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -2313,7 +2313,8 @@ static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb,
>
> if (filter->dump_exceptions) {
> err = fib_dump_info_fnhe(skb, cb, tb->tb_id, fi,
> - &i_fa, s_fa, flags);
> + &i_fa, s_fa, flags, xkey,
> + (KEYLENGTH - fa->fa_slen));
> if (err < 0)
> goto stop;
> }
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 0b4103b1e622..bc882c85228d 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -3049,10 +3049,25 @@ static int rt_fill_info(struct net *net, __be32 dst, __be32 src,
> return -EMSGSIZE;
> }
>
> +static int fnhe_daddr_check(__be32 daddr, struct net *net, u32 table_id,
> + __be32 prefix, unsigned char prefixlen)
> +{
> + struct flowi4 fl4 = { .daddr = daddr };
> + struct fib_table *tb = fib_get_table(net, table_id);
> + struct fib_result res;
> + int err = fib_table_lookup(tb, &fl4, &res, FIB_LOOKUP_NOREF);
I get the fundamental problem you want to solve. I think a fib lookup on
each nexthop exception for each leaf is a heavyweight solution this is
going to add up to significant overhead.
The fundamental problem you are trying to solve is to not walk the
exceptions for a fib_info more than once. A fib_info can be used with
many fib_entries so perhaps the solution is to walk the fib_info structs
that exist in fib_info_hash outside of the trie walk.
> +
> + if (!err && res.prefix == prefix && res.prefixlen == prefixlen)
> + return 1;
> +
> + return 0;
> +}
> +
> static int fnhe_dump_bucket(struct net *net, struct sk_buff *skb,
> struct netlink_callback *cb, u32 table_id,
> struct fnhe_hash_bucket *bucket, int genid,
> - int *fa_index, int fa_start, unsigned int flags)
> + int *fa_index, int fa_start, unsigned int flags,
> + __be32 prefix, unsigned char prefixlen)
> {
> int i;
>
> @@ -3067,6 +3082,9 @@ static int fnhe_dump_bucket(struct net *net, struct sk_buff *skb,
> if (*fa_index < fa_start)
> goto next;
>
> + if (!fnhe_daddr_check(fnhe->fnhe_daddr, net, table_id, prefix, prefixlen))
> + goto next;
> +
> if (fnhe->fnhe_genid != genid)
> goto next;
>
> @@ -3096,7 +3114,8 @@ static int fnhe_dump_bucket(struct net *net, struct sk_buff *skb,
>
> int fib_dump_info_fnhe(struct sk_buff *skb, struct netlink_callback *cb,
> u32 table_id, struct fib_info *fi,
> - int *fa_index, int fa_start, unsigned int flags)
> + int *fa_index, int fa_start, unsigned int flags,
> + __be32 prefix, unsigned char prefixlen)
> {
> struct net *net = sock_net(cb->skb->sk);
> int nhsel, genid = fnhe_genid(net);
> @@ -3115,7 +3134,7 @@ int fib_dump_info_fnhe(struct sk_buff *skb, struct netlink_callback *cb,
> if (bucket)
> err = fnhe_dump_bucket(net, skb, cb, table_id, bucket,
> genid, fa_index, fa_start,
> - flags);
> + flags, prefix, prefixlen);
> rcu_read_unlock();
> if (err)
> return err;
Powered by blists - more mailing lists