lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241112085655.GA19776@incl>
Date: Tue, 12 Nov 2024 09:56:55 +0100
From: Jiri Wiesner <jwiesner@...e.de>
To: Xin Long <lucien.xin@...il.com>
Cc: Eric Dumazet <edumazet@...gle.com>, netdev@...r.kernel.org,
	David Ahern <dsahern@...nel.org>, Jakub Kicinski <kuba@...nel.org>,
	Paolo Abeni <pabeni@...hat.com>,
	"David S. Miller" <davem@...emloft.net>
Subject: Re: [RFC PATCH] ipv6: route: release reference of dsts cached in
 sockets

On Sun, Oct 06, 2024 at 02:25:25PM -0400, Xin Long wrote:
> We recently also encountered this
>
>   'unregister_netdevice: waiting for lo to become free. Usage count = X'
>
> problem on our customer env after backporting
>
>   Commit 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). [1]
>
> The commit looks correct to me, so I guess it may uncover some existing
> issues.
>
> As it took a very long time to get reproduced on our customer env, which
> made it impossible to debug. Also the issue existed even after
> disabling IPv6.
>
> It seems much easier to reproduce it on your customer env. So I'm wondering
>
> - Was the testing on your customer env related to IPv6 ?
> - Does the issue still exist after reverting the commit [1] ?

The customer tried reproducing the issue with 92f1655aa2b22 ("net: fix __dst_negative_advice() race") reverted and the issue appeared again. My next step was capturing more points within the stacktraces when dst refcounts are changed (I do not have full stack traces - the trace entries I store in the hash contain only the instruction pointer and the parent function). This is the trace for the leaked dst object:
> dst ff1c4157413bf900
> alloc: 1 destroy:  obj diff: 1
> hold ops: 5 put ops: 4 refcnt diff: 1
> Function                            Parent                           Op  Net            Device Dst              Number of Calls
> inet6_csk_route_socket+0x1c2/0x2d0: inet6_csk_update_pmtu+0x58/0x90: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> __ip6_rt_update_pmtu+0x183/0x3c0: inet6_csk_update_pmtu+0x4b/0x90: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> ip6_dst_lookup_flow+0x4f/0x1d0: inet6_csk_route_socket+0x198/0x2d0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> ip6_dst_ifdown+0x5/0x250: dst_dev_put+0x2d/0xd0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> tcp_v6_connect+0x37e/0x790: __inet_stream_connect+0x2c8/0x3a0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> ip6_dst_lookup_flow+0x4f/0x1d0: tcp_v6_connect+0x320/0x790: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> __dev_queue_xmit+0x1af/0xd20: ip6_finish_output2+0x1f1/0x6e0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 675
> inet6_csk_xmit+0xa1/0x150: __tcp_transmit_skb+0x5f8/0xd40: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 675
> Function                            Parent                           Op      Net            Device Dst              Refcount Diff
> ip6_route_output_flags+0x76/0x230: ip6_dst_lookup_tail+0x215/0x250: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> ip6_negative_advice+0x66/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> dst_alloc+0x5e/0x180: ip6_dst_alloc+0x27/0x60: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> ip6_route_output_flags+0x76/0x230: ip6_dst_lookup_tail+0x10f/0x250: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> dst_release+0x32/0x140: ip6_negative_advice+0x137/0x2d0: dst_put ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> dst_release+0x32/0x140: inet_sock_destruct+0x146/0x1c0: dst_put ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> dst_release+0x32/0x140: rt6_remove_exception.part.53+0x7f/0xe0: dst_put ffffffff90ee2a80 blackhole_dev ff1c4157413bf900 1
> Function                 Parent                      Op     Net              Dst              Socket           Number of Calls
> sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c4158c07aaf80 1
> sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c418e49bac280 1
> sk_setup_caps+0x5/0x200: inet6_csk_route_socket+0x23c/0x2d0: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c418ea04f7200 1

The refcount of the dst was changed only when a dst_hold or dst_put operation was logged. The hold in dst_alloc() is balanced by the dst_release() in rt6_remove_exception(). The three holds in ip6_route_output_flags() belonged to the references held by sockets which were balanced by the dst_release() in inet_sock_destruct(). But that leaves one hold operation outstanding and unbalanced. There is a put operation for expired cached dst objects in ip6_negative_advice() but that one is intentionally balanced by a hold in the same function.

Accoding to the sk_dst trace entries, three sockets held a reference to the leaked dst. All of these sockets were eventually distroyed so there was no socket leak causing the dst leak, which rules out one of the possbile hypotheses. The dst was stored in the socket object while a connect() sycall was being executed for two of the sockets. An IMCPv6 packet carrying a Packet Too Big message was received by the third socket, which lead to the execution of routines changing the MTU for the dst and storing a reference to the dst in the socket object. As for races, sockets are locked in inet_stream_connect, tcp_write_timer() and tcp_v6_err(), and the mutual exclusion prevents races when reference to dst objects are set and reset for sockets. The put operation executed from dst_release() in inet_sock_destruct() is not part of a critical section protected by mutual exclusion but it is guaranteed to run only after the refcount of the socket has reached zero. All the other code paths in the trace increment the refcount of the socket while they manipulate dsts (the connect() syscall uses the refcount incremented by the socket() syscall).

This is a trace for the socket for which ip6_negative_advice() was executed for an expired cached dst:
> socket ff1c4158c07aaf80
> alloc: 1 destroy: 1 obj diff: 0
> hold ops: 9 put ops: 9 refcnt diff: 0
> ip6_negative_advice+0xcd/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: sk_dst ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> tcp_retransmit_timer+0x50d/0xac0: tcp_write_timer_handler+0xba/0x1f0: sk_dst ff1c4151bcac1080 0 ff1c4158c07aaf80 2
> ip6_negative_advice+0x169/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: sk_dst ff1c4151bcac1080 0 ff1c4158c07aaf80 3
> sk_setup_caps+0x5/0x200: inet6_sk_rebuild_header+0x1f9/0x2a0: sk_dst ff1c4151bcac1080 ff1c4153a762d500 ff1c4158c07aaf80 5
> sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c4158c07aaf80 1
> sk_free+0x5/0x100: tcp_close+0x100/0x120: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> __sk_free+0x5/0x190: sk_free+0xe3/0x100: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> sk_destruct+0x5/0x150: __sk_free+0x83/0x190: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> sk_alloc+0x195/0x2a0: inet6_create+0xd5/0x450: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> sk_free+0x5/0x100: ip6_rcv_core.isra.25+0x27c/0x440: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 7
> sk_reset_timer+0x7b/0x130: tcp_connect+0x83f/0xe00: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> __tcp_close+0x15d/0x4c0: tcp_close+0x35/0x120: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> sock_init_data_uid+0x107/0x2d0: inet6_create+0xec/0x450: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> sk_reset_timer+0x7b/0x130: tcp_retransmit_timer+0x4e3/0xac0: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 6
> tcp_close+0x5/0x120: inet_release+0x3c/0x80: sk_put ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> inet_csk_destroy_sock+0x90/0x1e0: __tcp_close+0x2ca/0x4c0: sk_put ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> tcp_write_timer+0x61/0x260: call_timer_fn+0x27/0x130: sk_put ff1c4151bcac1080 0 ff1c4158c07aaf80 7

Besides dst ff1c4157413bf900, this sockets held a reference to another dst - ff1c4153a762d500. Regardless of the order in which the dst references were held, the extra dst_hold() in ip6_negative_advice() would preserve the refcount increment owned by the socket while also resetting the sk_dst_cache member of the socket to NULL, rendering it impossible for the socket to decrement the refcount in the future. We see that the socket was destroyed but the sk_dst_cache member was most probably NULL at the time because all the dst references were reset in tcp_retransmit_timer() and ip6_negative_advice().

Based on this data, I went back to recheck the kprobe module that was meant to implement a dst_release() in ip6_negative_advice(). I found a mistake in the module that resulted in executing:
> static void ip6_negative_advice(struct sock *sk, struct dst_entry *dst)
> {
>         struct rt6_info *rt = (struct rt6_info *) dst;
>         if (rt->rt6i_flags & RTF_CACHE) {
>                 rcu_read_lock();
>                 if (rt6_check_expired(rt)) {
>                         /* counteract the dst_release() in sk_dst_reset() */
>                         dst_hold(dst);
>                         sk_dst_reset(sk);
>                         rt6_remove_exception_rt(rt);
>                 }
> --->		  kprobe_calls_dst_release;
>                 rcu_read_unlock();
>                 return;
>         }
>         sk_dst_reset(sk);
> }
instead of this:
> static void ip6_negative_advice(struct sock *sk, struct dst_entry *dst)
> {
>         struct rt6_info *rt = (struct rt6_info *) dst;
>         if (rt->rt6i_flags & RTF_CACHE) {
>                 rcu_read_lock();
>                 if (rt6_check_expired(rt)) {
>                         /* counteract the dst_release() in sk_dst_reset() */
>                         dst_hold(dst);
> --->			  kprobe_calls_dst_release;
>                         sk_dst_reset(sk);
>                         rt6_remove_exception_rt(rt);
>                 }
>                 rcu_read_unlock();
>                 return;
>         }
>         sk_dst_reset(sk);
> }
which clearly shows why the test with the module resulted in negative overflow of dst refcounts. I fixed the module and further testing showed the issue is no longer reproducible.

I have managed to put together a minimal set of steps needed to reproduce the issue:
ip link add veth1 mtu 65535 type veth peer veth0 mtu 65535
ip netns add ns0
ip link set veth1 netns ns0
ip addr add fd00::1/24 dev veth0
ip -n ns0 addr add fd00::2/24 dev veth1
ip link set up dev veth0
ip -n ns0 link set up dev lo
ip -n ns0 link set up dev veth1
ip -n ns0 route add default via fd00::1 dev veth1

ip link add veth3 mtu 65535 type veth peer veth2 mtu 65535
ip netns add ns2
ip link set veth3 netns ns2
ip addr add fd02::1/24 dev veth2
ip -n ns2 addr add fd02::2/24 dev veth3
ip link set up dev veth2
ip -n ns2 link set up dev lo
ip -n ns2 link set up dev veth3
ip -n ns2 route add default via fd02::1 dev veth3

ip netns exec ns0 bash -c "echo 6 > /proc/sys/net/ipv6/route/mtu_expires"
#ip netns exec ns2 bash -c "echo 6 > /proc/sys/net/ipv6/route/mtu_expires"
ip netns exec ns0 bash -c "echo 900 > /proc/sys/net/ipv6/route/gc_interval"
#ip netns exec ns2 bash -c "echo 900 > /proc/sys/net/ipv6/route/gc_interval"
sleep 30

ip6tables -F
ip6tables -A FORWARD -i veth0 -d fd02::/24 -j ACCEPT
ip6tables -A FORWARD -i veth2 -d fd00::/24 -j ACCEPT
echo 1 > /proc/sys/net/ipv6/conf/all/forwarding
ip6tables -L -v

(ip netns exec ns2 netcat -6 -l -s fd02::2 -p 1234 &> /dev/null) & serv=$!
sleep 1
dd if=/dev/zero bs=1M count=100 | ip netns exec ns0 netcat -6 fd02::2 1234 & clnt=$!
sleep 1
kill $clnt $serv
wait $serv

(ip netns exec ns2 netcat -6 -l -s fd02::2 -p 1234 &> /dev/null) & serv=$!
sleep 1
dd if=/dev/zero bs=1M | ip netns exec ns0 netcat -6 fd02::2 1234 & clnt=$!
sleep 1
ip link set veth2 mtu 2000
sleep 1
ip6tables -D FORWARD -i veth2 -d fd00::/24 -j ACCEPT
ip6tables -A FORWARD -i veth2 -d fd00::/24 -j DROP

sleep 10
kill $clnt $serv
wait $serv

ip -n ns0 link set down dev lo
ip -n ns0 link set down dev veth1
ip -n ns0 link delete dev veth1
ip netns delete ns0

ip -n ns2 link set down dev lo
ip -n ns2 link set down dev veth3
ip -n ns2 link delete dev veth3
ip netns delete ns2

I was able to reproduce the dst leak under 6.12-rc7. I will submit a fix shortly.
-- 
Jiri Wiesner
SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ