netdev - Re: [PATCH net 2/3] net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YZp/MvIX/YCHJY9K@shredder>
Date:   Sun, 21 Nov 2021 19:17:38 +0200
From:   Ido Schimmel <idosch@...sch.org>
To:     Nikolay Aleksandrov <razor@...ckwall.org>
Cc:     netdev@...r.kernel.org, davem@...emloft.net, kuba@...nel.org,
        dsahern@...il.com, Nikolay Aleksandrov <nikolay@...dia.com>
Subject: Re: [PATCH net 2/3] net: nexthop: release IPv6 per-cpu dsts when
 replacing a nexthop group

On Sun, Nov 21, 2021 at 05:24:52PM +0200, Nikolay Aleksandrov wrote:
> From: Nikolay Aleksandrov <nikolay@...dia.com>
> 
> When replacing a nexthop group, we must release the IPv6 per-cpu dsts of
> the removed nexthop entries after an RCU grace period because they
> contain references to the nexthop's net device and to the fib6 info.
> With specific series of events[1] we can reach net device refcount
> imbalance which is unrecoverable.
> 
> [1]
>  $ ip nexthop list
>   id 200 via 2002:db8::2 dev bridge.10 scope link onlink
>   id 201 via 2002:db8::3 dev bridge scope link onlink
>   id 203 group 201/200
>  $ ip -6 route
>   2001:db8::10 nhid 203 metric 1024 pref medium
>      nexthop via 2002:db8::3 dev bridge weight 1 onlink
>      nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink
> 
> Create rt6_info through one of the multipath legs, e.g.:
>  $ taskset -a -c 1  ./pkt_inj 24 bridge.10 2001:db8::10
>  (pkt_inj is just a custom packet generator, nothing special)
> 
> Then remove that leg from the group by replace (let's assume it is id
> 200 in this case):
>  $ ip nexthop replace id 203 group 201
> 
> Now remove the IPv6 route:
>  $ ip -6 route del 2001:db8::10/128
> 
> The route won't be really deleted due to the stale rt6_info holding 1
> refcnt in nexthop id 200.
> At this point we have the following reference count dependency:
>  (deleted) IPv6 route holds 1 reference over nhid 203
>  nh 203 holds 1 ref over id 201
>  nh 200 holds 1 ref over the net device and the route due to the stale
>  rt6_info
> 
> Now to create circular dependency between nh 200 and the IPv6 route, and
> also to get a reference over nh 200, restore nhid 200 in the group:
>  $ ip nexthop replace id 203 group 201/200
> 
> And now we have a permanent circular dependncy because nhid 203 holds a
> reference over nh 200 and 201, but the route holds a ref over nh 203 and
> is deleted.
> 
> To trigger the bug just delete the group (nhid 203):
>  $ ip nexthop del id 203
> 
> It won't really be deleted due to the IPv6 route dependency, and now we
> have 2 unlinked and deleted objects that reference each other: the group
> and the IPv6 route. Since the group drops the reference it holds over its
> entries at free time (i.e. its own refcount needs to drop to 0) that will
> never happen and we get a permanent ref on them, since one of the entries
> holds a reference over the IPv6 route it will also never be released.
> 
> At this point the dependencies are:
>  (deleted, only unlinked) IPv6 route holds reference over group nh 203
>  (deleted, only unlinked) group nh 203 holds reference over nh 201 and 200
>  nh 200 holds 1 ref over the net device and the route due to the stale
>  rt6_info
> 
> This is the last point where it can be fixed by running traffic through
> nh 200, and specifically through the same CPU so the rt6_info (dst) will
> get released due to the IPv6 genid, that in turn will free the IPv6
> route, which in turn will free the ref count over the group nh 203.
> 
> If nh 200 is deleted at this point, it will never be released due to the
> ref from the unlinked group 203, it will only be unlinked:
>  $ ip nexthop del id 200
>  $ ip nexthop
>  $
> 
> Now we can never release that stale rt6_info, we have IPv6 route with ref
> over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with
> rt6_info (dst) with ref over the net device and the IPv6 route. All of
> these objects are only unlinked, and cannot be released, thus they can't
> release their ref counts.
> 
>  Message from syslogd@dev at Nov 19 14:04:10 ...
>   kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
>  Message from syslogd@dev at Nov 19 14:04:20 ...
>   kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
> 
> Fixes: 7bf4796dd099 ("nexthops: add support for replace")
> Signed-off-by: Nikolay Aleksandrov <nikolay@...dia.com>
> ---
>  net/ipv4/nexthop.c | 25 +++++++++++++++++++++++--
>  1 file changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
> index 9e8100728d46..a69a9e76f99f 100644
> --- a/net/ipv4/nexthop.c
> +++ b/net/ipv4/nexthop.c
> @@ -1899,15 +1899,36 @@ static void remove_nexthop(struct net *net, struct nexthop *nh,
>  /* if any FIB entries reference this nexthop, any dst entries
>   * need to be regenerated
>   */
> -static void nh_rt_cache_flush(struct net *net, struct nexthop *nh)
> +static void nh_rt_cache_flush(struct net *net, struct nexthop *nh,
> +			      struct nexthop *replaced_nh)
>  {
>  	struct fib6_info *f6i;
> +	struct nh_group *nhg;
> +	int i;
>  
>  	if (!list_empty(&nh->fi_list))
>  		rt_cache_flush(net);
>  
>  	list_for_each_entry(f6i, &nh->f6i_list, nh_list)
>  		ipv6_stub->fib6_update_sernum(net, f6i);
> +
> +	/* if an IPv6 group was replaced, we have to release all old
> +	 * dsts to make sure all refcounts are released
> +	 */

This problem is specific to IPv6 because IPv4 dst entries do not hold
references on routes / FIB info thereby avoiding the circular dependency
described in the commit message?

> +	if (!replaced_nh->is_group)
> +		return;

Does it also make sense to skip the part below if we don't have any IPv6
routes using the nexthop?

> +
> +	/* new dsts must use only the new nexthop group */
> +	synchronize_net();
> +
> +	nhg = rtnl_dereference(replaced_nh->nh_grp);

In replace_nexthop_grp() that precedes this function we assign the new
nh_group to the nexthop shell used by the routes:

rcu_assign_pointer(old->nh_grp, newg);

And the old one that you want to purge is stored in
'replaced_nh->nh_grp':

rcu_assign_pointer(new->nh_grp, oldg);

The need for synchronize_net() above stems from the fact that some CPUs
might still be using 'oldg' via 'old->nh_grp'?

If so, we already have one synchronize_net() in replace_nexthop_grp()
for resilient groups. See:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=563f23b002534176f49524b5ca0e1d94d8906c40

Can we avoid two synchronize_net() per resilient group by removing the
one added here and instead do:

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index a69a9e76f99f..a47ce43ab1ff 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -2002,9 +2002,10 @@ static int replace_nexthop_grp(struct net *net, struct nexthop *old,
 
        rcu_assign_pointer(old->nh_grp, newg);
 
+       /* Make sure concurrent readers are not using 'oldg' anymore. */
+       synchronize_net();
+
        if (newg->resilient) {
-               /* Make sure concurrent readers are not using 'oldg' anymore. */
-               synchronize_net();
                rcu_assign_pointer(oldg->res_table, tmp_table);
                rcu_assign_pointer(oldg->spare->res_table, tmp_table);
        }

> +	for (i = 0; i < nhg->num_nh; i++) {
> +		struct nh_grp_entry *nhge = &nhg->nh_entries[i];
> +		struct nh_info *nhi = rtnl_dereference(nhge->nh->nh_info);
> +
> +		if (nhi->family == AF_INET6)
> +			ipv6_stub->fib6_nh_release_dsts(&nhi->fib6_nh);
> +	}
>  }
>  
>  static int replace_nexthop_grp(struct net *net, struct nexthop *old,
> @@ -2247,7 +2268,7 @@ static int replace_nexthop(struct net *net, struct nexthop *old,
>  		err = replace_nexthop_single(net, old, new, extack);
>  
>  	if (!err) {
> -		nh_rt_cache_flush(net, old);
> +		nh_rt_cache_flush(net, old, new);
>  
>  		__remove_nexthop(net, new, NULL);
>  		nexthop_put(new);
> -- 
> 2.31.1
>