netdev - Re: [RFC PATCH] ipv6: route: release reference of dsts cached in sockets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADvbK_dz9ewsEmGa63DgMOwRwFyE-evALq61CUYi54-K_WTvog@mail.gmail.com>
Date: Tue, 12 Nov 2024 16:51:27 -0500
From: Xin Long <lucien.xin@...il.com>
To: Jiri Wiesner <jwiesner@...e.de>
Cc: Eric Dumazet <edumazet@...gle.com>, netdev@...r.kernel.org, 
	David Ahern <dsahern@...nel.org>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	"David S. Miller" <davem@...emloft.net>
Subject: Re: [RFC PATCH] ipv6: route: release reference of dsts cached in sockets

On Tue, Nov 12, 2024 at 3:57 AM Jiri Wiesner <jwiesner@...e.de> wrote:
>
> On Sun, Oct 06, 2024 at 02:25:25PM -0400, Xin Long wrote:
> > We recently also encountered this
> >
> >   'unregister_netdevice: waiting for lo to become free. Usage count = X'
> >
> > problem on our customer env after backporting
> >
> >   Commit 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). [1]
> >
> > The commit looks correct to me, so I guess it may uncover some existing
> > issues.
> >
> > As it took a very long time to get reproduced on our customer env, which
> > made it impossible to debug. Also the issue existed even after
> > disabling IPv6.
> >
> > It seems much easier to reproduce it on your customer env. So I'm wondering
> >
> > - Was the testing on your customer env related to IPv6 ?
> > - Does the issue still exist after reverting the commit [1] ?
>
> The customer tried reproducing the issue with 92f1655aa2b22 ("net: fix __dst_negative_advice() race") reverted and the issue appeared again. My next step was capturing more points within the stacktraces when dst refcounts are changed (I do not have full stack traces - the trace entries I store in the hash contain only the instruction pointer and the parent function). This is the trace for the leaked dst object:
> > dst ff1c4157413bf900
> > alloc: 1 destroy:  obj diff: 1
> > hold ops: 5 put ops: 4 refcnt diff: 1
> > Function                            Parent                           Op  Net            Device Dst              Number of Calls
> > inet6_csk_route_socket+0x1c2/0x2d0: inet6_csk_update_pmtu+0x58/0x90: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > __ip6_rt_update_pmtu+0x183/0x3c0: inet6_csk_update_pmtu+0x4b/0x90: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > ip6_dst_lookup_flow+0x4f/0x1d0: inet6_csk_route_socket+0x198/0x2d0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > ip6_dst_ifdown+0x5/0x250: dst_dev_put+0x2d/0xd0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > tcp_v6_connect+0x37e/0x790: __inet_stream_connect+0x2c8/0x3a0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> > ip6_dst_lookup_flow+0x4f/0x1d0: tcp_v6_connect+0x320/0x790: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> > __dev_queue_xmit+0x1af/0xd20: ip6_finish_output2+0x1f1/0x6e0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 675
> > inet6_csk_xmit+0xa1/0x150: __tcp_transmit_skb+0x5f8/0xd40: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 675
> > Function                            Parent                           Op      Net            Device Dst              Refcount Diff
> > ip6_route_output_flags+0x76/0x230: ip6_dst_lookup_tail+0x215/0x250: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > ip6_negative_advice+0x66/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > dst_alloc+0x5e/0x180: ip6_dst_alloc+0x27/0x60: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > ip6_route_output_flags+0x76/0x230: ip6_dst_lookup_tail+0x10f/0x250: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> > dst_release+0x32/0x140: ip6_negative_advice+0x137/0x2d0: dst_put ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > dst_release+0x32/0x140: inet_sock_destruct+0x146/0x1c0: dst_put ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> > dst_release+0x32/0x140: rt6_remove_exception.part.53+0x7f/0xe0: dst_put ffffffff90ee2a80 blackhole_dev ff1c4157413bf900 1
> > Function                 Parent                      Op     Net              Dst              Socket           Number of Calls
> > sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c4158c07aaf80 1
> > sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c418e49bac280 1
> > sk_setup_caps+0x5/0x200: inet6_csk_route_socket+0x23c/0x2d0: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c418ea04f7200 1
>
> The refcount of the dst was changed only when a dst_hold or dst_put operation was logged. The hold in dst_alloc() is balanced by the dst_release() in rt6_remove_exception(). The three holds in ip6_route_output_flags() belonged to the references held by sockets which were balanced by the dst_release() in inet_sock_destruct(). But that leaves one hold operation outstanding and unbalanced. There is a put operation for expired cached dst objects in ip6_negative_advice() but that one is intentionally balanced by a hold in the same function.
>
This makes sense to me.

The code prior to 92f1655aa2b22 ("net: fix __dst_negative_advice()
race") had no longer been able to release the cached dst for the
reference held by socket in ip6_negative_advice() since
rt6_remove_exception_rt() is called there.

Hi, David Ahern,

Can you confirm this?

Thanks.

> Accoding to the sk_dst trace entries, three sockets held a reference to the leaked dst. All of these sockets were eventually distroyed so there was no socket leak causing the dst leak, which rules out one of the possbile hypotheses. The dst was stored in the socket object while a connect() sycall was being executed for two of the sockets. An IMCPv6 packet carrying a Packet Too Big message was received by the third socket, which lead to the execution of routines changing the MTU for the dst and storing a reference to the dst in the socket object. As for races, sockets are locked in inet_stream_connect, tcp_write_timer() and tcp_v6_err(), and the mutual exclusion prevents races when reference to dst objects are set and reset for sockets. The put operation executed from dst_release() in inet_sock_destruct() is not part of a critical section protected by mutual exclusion but it is guaranteed to run only after the refcount of the socket has reached zero. All the other code paths in the trace increment the refcount of the socket while they manipulate dsts (the connect() syscall uses the refcount incremented by the socket() syscall).
>
> This is a trace for the socket for which ip6_negative_advice() was executed for an expired cached dst:
> > socket ff1c4158c07aaf80
> > alloc: 1 destroy: 1 obj diff: 0
> > hold ops: 9 put ops: 9 refcnt diff: 0
> > ip6_negative_advice+0xcd/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: sk_dst ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > tcp_retransmit_timer+0x50d/0xac0: tcp_write_timer_handler+0xba/0x1f0: sk_dst ff1c4151bcac1080 0 ff1c4158c07aaf80 2
> > ip6_negative_advice+0x169/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: sk_dst ff1c4151bcac1080 0 ff1c4158c07aaf80 3
> > sk_setup_caps+0x5/0x200: inet6_sk_rebuild_header+0x1f9/0x2a0: sk_dst ff1c4151bcac1080 ff1c4153a762d500 ff1c4158c07aaf80 5
> > sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c4158c07aaf80 1
> > sk_free+0x5/0x100: tcp_close+0x100/0x120: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > __sk_free+0x5/0x190: sk_free+0xe3/0x100: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > sk_destruct+0x5/0x150: __sk_free+0x83/0x190: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > sk_alloc+0x195/0x2a0: inet6_create+0xd5/0x450: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > sk_free+0x5/0x100: ip6_rcv_core.isra.25+0x27c/0x440: sk ff1c4151bcac1080 0 ff1c4158c07aaf80 7
> > sk_reset_timer+0x7b/0x130: tcp_connect+0x83f/0xe00: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > __tcp_close+0x15d/0x4c0: tcp_close+0x35/0x120: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > sock_init_data_uid+0x107/0x2d0: inet6_create+0xec/0x450: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > sk_reset_timer+0x7b/0x130: tcp_retransmit_timer+0x4e3/0xac0: sk_hold ff1c4151bcac1080 0 ff1c4158c07aaf80 6
> > tcp_close+0x5/0x120: inet_release+0x3c/0x80: sk_put ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > inet_csk_destroy_sock+0x90/0x1e0: __tcp_close+0x2ca/0x4c0: sk_put ff1c4151bcac1080 0 ff1c4158c07aaf80 1
> > tcp_write_timer+0x61/0x260: call_timer_fn+0x27/0x130: sk_put ff1c4151bcac1080 0 ff1c4158c07aaf80 7
>
> Besides dst ff1c4157413bf900, this sockets held a reference to another dst - ff1c4153a762d500. Regardless of the order in which the dst references were held, the extra dst_hold() in ip6_negative_advice() would preserve the refcount increment owned by the socket while also resetting the sk_dst_cache member of the socket to NULL, rendering it impossible for the socket to decrement the refcount in the future. We see that the socket was destroyed but the sk_dst_cache member was most probably NULL at the time because all the dst references were reset in tcp_retransmit_timer() and ip6_negative_advice().
>
> Based on this data, I went back to recheck the kprobe module that was meant to implement a dst_release() in ip6_negative_advice(). I found a mistake in the module that resulted in executing:
> > static void ip6_negative_advice(struct sock *sk, struct dst_entry *dst)
> > {
> >         struct rt6_info *rt = (struct rt6_info *) dst;
> >         if (rt->rt6i_flags & RTF_CACHE) {
> >                 rcu_read_lock();
> >                 if (rt6_check_expired(rt)) {
> >                         /* counteract the dst_release() in sk_dst_reset() */
> >                         dst_hold(dst);
> >                         sk_dst_reset(sk);
> >                         rt6_remove_exception_rt(rt);
> >                 }
> > --->            kprobe_calls_dst_release;
> >                 rcu_read_unlock();
> >                 return;
> >         }
> >         sk_dst_reset(sk);
> > }
> instead of this:
> > static void ip6_negative_advice(struct sock *sk, struct dst_entry *dst)
> > {
> >         struct rt6_info *rt = (struct rt6_info *) dst;
> >         if (rt->rt6i_flags & RTF_CACHE) {
> >                 rcu_read_lock();
> >                 if (rt6_check_expired(rt)) {
> >                         /* counteract the dst_release() in sk_dst_reset() */
> >                         dst_hold(dst);
> > --->                    kprobe_calls_dst_release;
> >                         sk_dst_reset(sk);
> >                         rt6_remove_exception_rt(rt);
> >                 }
> >                 rcu_read_unlock();
> >                 return;
> >         }
> >         sk_dst_reset(sk);
> > }
> which clearly shows why the test with the module resulted in negative overflow of dst refcounts. I fixed the module and further testing showed the issue is no longer reproducible.
>
> I have managed to put together a minimal set of steps needed to reproduce the issue:
> ip link add veth1 mtu 65535 type veth peer veth0 mtu 65535
> ip netns add ns0
> ip link set veth1 netns ns0
> ip addr add fd00::1/24 dev veth0
> ip -n ns0 addr add fd00::2/24 dev veth1
> ip link set up dev veth0
> ip -n ns0 link set up dev lo
> ip -n ns0 link set up dev veth1
> ip -n ns0 route add default via fd00::1 dev veth1
>
> ip link add veth3 mtu 65535 type veth peer veth2 mtu 65535
> ip netns add ns2
> ip link set veth3 netns ns2
> ip addr add fd02::1/24 dev veth2
> ip -n ns2 addr add fd02::2/24 dev veth3
> ip link set up dev veth2
> ip -n ns2 link set up dev lo
> ip -n ns2 link set up dev veth3
> ip -n ns2 route add default via fd02::1 dev veth3
>
> ip netns exec ns0 bash -c "echo 6 > /proc/sys/net/ipv6/route/mtu_expires"
> #ip netns exec ns2 bash -c "echo 6 > /proc/sys/net/ipv6/route/mtu_expires"
> ip netns exec ns0 bash -c "echo 900 > /proc/sys/net/ipv6/route/gc_interval"
> #ip netns exec ns2 bash -c "echo 900 > /proc/sys/net/ipv6/route/gc_interval"
> sleep 30
>
> ip6tables -F
> ip6tables -A FORWARD -i veth0 -d fd02::/24 -j ACCEPT
> ip6tables -A FORWARD -i veth2 -d fd00::/24 -j ACCEPT
> echo 1 > /proc/sys/net/ipv6/conf/all/forwarding
> ip6tables -L -v
>
> (ip netns exec ns2 netcat -6 -l -s fd02::2 -p 1234 &> /dev/null) & serv=$!
> sleep 1
> dd if=/dev/zero bs=1M count=100 | ip netns exec ns0 netcat -6 fd02::2 1234 & clnt=$!
> sleep 1
> kill $clnt $serv
> wait $serv
>
> (ip netns exec ns2 netcat -6 -l -s fd02::2 -p 1234 &> /dev/null) & serv=$!
> sleep 1
> dd if=/dev/zero bs=1M | ip netns exec ns0 netcat -6 fd02::2 1234 & clnt=$!
> sleep 1
> ip link set veth2 mtu 2000
> sleep 1
> ip6tables -D FORWARD -i veth2 -d fd00::/24 -j ACCEPT
> ip6tables -A FORWARD -i veth2 -d fd00::/24 -j DROP
>
> sleep 10
> kill $clnt $serv
> wait $serv
>
> ip -n ns0 link set down dev lo
> ip -n ns0 link set down dev veth1
> ip -n ns0 link delete dev veth1
> ip netns delete ns0
>
> ip -n ns2 link set down dev lo
> ip -n ns2 link set down dev veth3
> ip -n ns2 link delete dev veth3
> ip netns delete ns2
>
> I was able to reproduce the dst leak under 6.12-rc7. I will submit a fix shortly.
> --
> Jiri Wiesner
> SUSE Labs