lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240930180916.GA24637@incl>
Date: Mon, 30 Sep 2024 20:09:16 +0200
From: Jiri Wiesner <jwiesner@...e.de>
To: netdev@...r.kernel.org
Cc: Eric Dumazet <edumazet@...gle.com>, David Ahern <dsahern@...nel.org>,
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
	"David S. Miller" <davem@...emloft.net>
Subject: [RFC PATCH] ipv6: route: release reference of dsts cached in sockets

An unbalanced refcount was reported for the loopback device of a net
namespace being destroyed:
unregister_netdevice: waiting for lo to become free. Usage count = 2

Analysis revealed that the IPv6 net device corresponding to the loopback
device did not get destroyed (in6_dev_finish_destroy() was not called).
The IPv6 loopback device had an unbalaced refcount:
net dev 73da100 lo refcount 1
Operation                     Count Balancing Operation     Count
hold  __ipv6_dev_mc_inc           2 ma_put                      2
      addrconf_ifdown             1                             0 unbalanced
hold  fib6_nh_init                2 fib6_nh_init                2
put   inet6_ifa_finish_destroy    1 ipv6_add_addr               1
      ip6_dst_destroy            90                             0 unbalanced
      ip6_dst_ifdown             90                             0 unbalanced
hold  ip6_route_dev_notify        6 ip6_route_dev_notify        6
hold  ipv6_add_addr               1 inet6_ifa_finish_destroy    1
put   ma_put                      2 __ipv6_dev_mc_inc           2
hold  ndisc_netdev_event          2 ndisc_netdev_event          2
      rt6_disable_ip              1                             0 unbalanced

The refcount of addrconf_ifdown() balances the refcount increment in
ipv6_add_dev(), which had no corresponding trace entry. The
rt6_disable_ip() and ip6_dst_ifdown() entries were hold operations on the
looback device, and the ip6_dst_destroy() entries were put operations. One
refcount decrement in ip6_dst_destroy() was not executed. At this point, a
hash was implemented in the debug kernel to hold the changes of the
refcount of dst objects per namespace. The trace for the dst object that
did not decrement the IPv6 refcount of loopback follows:

Function        Parent       Op  Net            Device Dst              Refcount Diff
ip6_dst_ifdown: dst_dev_put: dst ff404b2f073da100 eth0 ff404af71ffc9c00 1
ip6_negative_advice: tcp_retransmit_timer: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1
dst_alloc: ip6_dst_alloc: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1
ip6_route_output_flags: ip6_dst_lookup_tail: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 84
dst_release: ip6_negative_advice: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 1
dst_release: tcp_retransmit_timer: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 20
dst_release: inet_sock_destruct: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 29
dst_release: __dev_queue_xmit: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 34
dst_release: rt6_remove_exception: dst_put ffffffff9c8e2a80 blackhole_dev ff404af71ffc9c00 1

The ip6_dst_ifdown() trace entry was neither a hold nor a put - it merely
indicates that the net device of the dst object was changed to
blackhole_dev and the IPv6 refcount was transferred onto the loopback
device. There was no ip6_dst_destroy() trace entry, which means the dst
object was not destroyed. There were 86 hold operations but only 85 put
operations so the dst object was not destroyed because the refcount of the
dst object was unbalanced.

The problem is that the refcount sums are ambiguous. The most probable
explanation is this: The dst object was a route for an IPv6 TCP connection
that kept timing out. Sometimes, the process closed the socket, which
corresponds to the refcount decrements of the
dst_release()/inet_sock_destruct() entries. Sometimes, the TCP retransmit
timer reset the dst of the sockets, which corresponds to the
dst_release()/tcp_retransmit_timer() entries. I am unsure about the
dst_release()/__dev_queue_xmit() entries because inet6_csk_xmit() sets
skb->_skb_refdst with SKB_DST_NOREF.

The feature that sets the above trace apart from all the other dst traces
is the execution of ip6_negative_advice() for a cached and also expired
dst object in the exception table. The cached and expired dst object has
its refcount set to at least 2 before executing rt6_remove_exception_rt()
found in ip6_negative_advice(). One decrement happens in
rt6_remove_exception() after the dst object has been removed from the
exception table. The other decrement happens in sk_dst_reset() but that
one is counteracted by a dst_hold() intentionally placed just before the
sk_dst_reset() in ip6_negative_advice(). The probem is that a socket that
keeps a reference to a dst in its sk_dst_cache member increments the
refcount of the dst by 1. This is apparent in the following code paths:
* When ip6_route_output_flags() finds a dst that is then stored in
  sk->sk_dst_cache by ip6_dst_store() called from inet6_csk_route_socket().
* When inet_sock_destruct() calls dst_release() for sk->sk_dst_cache
Provided the dst is not kept in the sk_dst_cache member of another socket,
there is no other object tied to the dst (the socket lost its reference
and the dst is no longer in the exception table) and the dst becomes a
leaked object after ip6_negative_advice() finishes. This leak then
precludes the net namespace from being destroyed.

The patch that introduced the dst_hold() in ip6_negative_advice() was
92f1655aa2b22 ("net: fix __dst_negative_advice() race"). But 92f1655aa2b22
only refactored the code with regards to the dst refcount so the issue was
present even before 92f1655aa2b22.

Signed-off-by: Jiri Wiesner <jwiesner@...e.de>
---
At the moment, I am sending this as an RFC because I am not able to 
reproduce the issue in-house. The customer that encountered the issue is 
currently running tests. For the customer's testing, I fixed the issue 
with a kprobe module that calls dst_release() right after 
rt6_remove_exception_rt() returns in ip6_negative_advice(), which is not 
quite the same as the change proposed below.

 net/ipv6/route.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b4251915585f..b70267c8d251 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2780,10 +2780,7 @@ static void ip6_negative_advice(struct sock *sk,
 	if (rt->rt6i_flags & RTF_CACHE) {
 		rcu_read_lock();
 		if (rt6_check_expired(rt)) {
-			/* counteract the dst_release() in sk_dst_reset() */
-			dst_hold(dst);
 			sk_dst_reset(sk);
-
 			rt6_remove_exception_rt(rt);
 		}
 		rcu_read_unlock();
-- 
2.35.3


-- 
Jiri Wiesner
SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ