[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJQDWEyTC5Xc77PEXyxbbvKjm=exb5jKB0-O3ZzZ=W1Hg@mail.gmail.com>
Date: Mon, 30 Sep 2024 20:27:50 +0200
From: Eric Dumazet <edumazet@...gle.com>
To: Jiri Wiesner <jwiesner@...e.de>
Cc: netdev@...r.kernel.org, David Ahern <dsahern@...nel.org>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
"David S. Miller" <davem@...emloft.net>
Subject: Re: [RFC PATCH] ipv6: route: release reference of dsts cached in sockets
On Mon, Sep 30, 2024 at 8:09 PM Jiri Wiesner <jwiesner@...e.de> wrote:
>
> An unbalanced refcount was reported for the loopback device of a net
> namespace being destroyed:
> unregister_netdevice: waiting for lo to become free. Usage count = 2
>
> Analysis revealed that the IPv6 net device corresponding to the loopback
> device did not get destroyed (in6_dev_finish_destroy() was not called).
> The IPv6 loopback device had an unbalaced refcount:
> net dev 73da100 lo refcount 1
> Operation Count Balancing Operation Count
> hold __ipv6_dev_mc_inc 2 ma_put 2
> addrconf_ifdown 1 0 unbalanced
> hold fib6_nh_init 2 fib6_nh_init 2
> put inet6_ifa_finish_destroy 1 ipv6_add_addr 1
> ip6_dst_destroy 90 0 unbalanced
> ip6_dst_ifdown 90 0 unbalanced
> hold ip6_route_dev_notify 6 ip6_route_dev_notify 6
> hold ipv6_add_addr 1 inet6_ifa_finish_destroy 1
> put ma_put 2 __ipv6_dev_mc_inc 2
> hold ndisc_netdev_event 2 ndisc_netdev_event 2
> rt6_disable_ip 1 0 unbalanced
>
> The refcount of addrconf_ifdown() balances the refcount increment in
> ipv6_add_dev(), which had no corresponding trace entry. The
> rt6_disable_ip() and ip6_dst_ifdown() entries were hold operations on the
> looback device, and the ip6_dst_destroy() entries were put operations. One
> refcount decrement in ip6_dst_destroy() was not executed. At this point, a
> hash was implemented in the debug kernel to hold the changes of the
> refcount of dst objects per namespace. The trace for the dst object that
> did not decrement the IPv6 refcount of loopback follows:
>
> Function Parent Op Net Device Dst Refcount Diff
> ip6_dst_ifdown: dst_dev_put: dst ff404b2f073da100 eth0 ff404af71ffc9c00 1
> ip6_negative_advice: tcp_retransmit_timer: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1
> dst_alloc: ip6_dst_alloc: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1
> ip6_route_output_flags: ip6_dst_lookup_tail: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 84
> dst_release: ip6_negative_advice: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 1
> dst_release: tcp_retransmit_timer: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 20
> dst_release: inet_sock_destruct: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 29
> dst_release: __dev_queue_xmit: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 34
> dst_release: rt6_remove_exception: dst_put ffffffff9c8e2a80 blackhole_dev ff404af71ffc9c00 1
>
> The ip6_dst_ifdown() trace entry was neither a hold nor a put - it merely
> indicates that the net device of the dst object was changed to
> blackhole_dev and the IPv6 refcount was transferred onto the loopback
> device. There was no ip6_dst_destroy() trace entry, which means the dst
> object was not destroyed. There were 86 hold operations but only 85 put
> operations so the dst object was not destroyed because the refcount of the
> dst object was unbalanced.
>
> The problem is that the refcount sums are ambiguous. The most probable
> explanation is this: The dst object was a route for an IPv6 TCP connection
> that kept timing out. Sometimes, the process closed the socket, which
> corresponds to the refcount decrements of the
> dst_release()/inet_sock_destruct() entries. Sometimes, the TCP retransmit
> timer reset the dst of the sockets, which corresponds to the
> dst_release()/tcp_retransmit_timer() entries. I am unsure about the
> dst_release()/__dev_queue_xmit() entries because inet6_csk_xmit() sets
> skb->_skb_refdst with SKB_DST_NOREF.
>
> The feature that sets the above trace apart from all the other dst traces
> is the execution of ip6_negative_advice() for a cached and also expired
> dst object in the exception table. The cached and expired dst object has
> its refcount set to at least 2 before executing rt6_remove_exception_rt()
> found in ip6_negative_advice(). One decrement happens in
> rt6_remove_exception() after the dst object has been removed from the
> exception table. The other decrement happens in sk_dst_reset() but that
> one is counteracted by a dst_hold() intentionally placed just before the
> sk_dst_reset() in ip6_negative_advice(). The probem is that a socket that
> keeps a reference to a dst in its sk_dst_cache member increments the
> refcount of the dst by 1. This is apparent in the following code paths:
> * When ip6_route_output_flags() finds a dst that is then stored in
> sk->sk_dst_cache by ip6_dst_store() called from inet6_csk_route_socket().
> * When inet_sock_destruct() calls dst_release() for sk->sk_dst_cache
> Provided the dst is not kept in the sk_dst_cache member of another socket,
> there is no other object tied to the dst (the socket lost its reference
> and the dst is no longer in the exception table) and the dst becomes a
> leaked object after ip6_negative_advice() finishes. This leak then
> precludes the net namespace from being destroyed.
>
> The patch that introduced the dst_hold() in ip6_negative_advice() was
> 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). But 92f1655aa2b22
> only refactored the code with regards to the dst refcount so the issue was
> present even before 92f1655aa2b22.
>
> Signed-off-by: Jiri Wiesner <jwiesner@...e.de>
> ---
> At the moment, I am sending this as an RFC because I am not able to
> reproduce the issue in-house. The customer that encountered the issue is
> currently running tests. For the customer's testing, I fixed the issue
> with a kprobe module that calls dst_release() right after
> rt6_remove_exception_rt() returns in ip6_negative_advice(), which is not
> quite the same as the change proposed below.
>
> net/ipv6/route.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index b4251915585f..b70267c8d251 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -2780,10 +2780,7 @@ static void ip6_negative_advice(struct sock *sk,
> if (rt->rt6i_flags & RTF_CACHE) {
> rcu_read_lock();
> if (rt6_check_expired(rt)) {
> - /* counteract the dst_release() in sk_dst_reset() */
> - dst_hold(dst);
> sk_dst_reset(sk);
> -
> rt6_remove_exception_rt(rt);
> }
> rcu_read_unlock();
> --
Interesting, what kernel version is your customer using ?
I think that with recent kernels (after 5.18), we do not see issues
because of the use of blackhole_netdev
instead of loopback.
Powered by blists - more mailing lists