netdev - Re: [RFC PATCH] ipv6: route: release reference of dsts cached in sockets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241113105912.GB19776@incl>
Date: Wed, 13 Nov 2024 11:59:12 +0100
From: Jiri Wiesner <jwiesner@...e.de>
To: Xin Long <lucien.xin@...il.com>
Cc: Eric Dumazet <edumazet@...gle.com>, netdev@...r.kernel.org,
	David Ahern <dsahern@...nel.org>, Jakub Kicinski <kuba@...nel.org>,
	Paolo Abeni <pabeni@...hat.com>,
	"David S. Miller" <davem@...emloft.net>
Subject: Re: [RFC PATCH] ipv6: route: release reference of dsts cached in
 sockets

On Tue, Nov 12, 2024 at 04:51:27PM -0500, Xin Long wrote:
> On Tue, Nov 12, 2024 at 3:57 AM Jiri Wiesner <jwiesner@...e.de> wrote:
> >
> > On Sun, Oct 06, 2024 at 02:25:25PM -0400, Xin Long wrote:
> > > We recently also encountered this
> > >
> > >   'unregister_netdevice: waiting for lo to become free. Usage count = X'
> > >
> > > problem on our customer env after backporting
> > >
> > >   Commit 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). [1]
> > >
> > > The commit looks correct to me, so I guess it may uncover some existing
> > > issues.
> > >
> > > As it took a very long time to get reproduced on our customer env, which
> > > made it impossible to debug. Also the issue existed even after
> > > disabling IPv6.
> > >
> > > It seems much easier to reproduce it on your customer env. So I'm wondering
> > >
> > > - Was the testing on your customer env related to IPv6 ?
> > > - Does the issue still exist after reverting the commit [1] ?
> >
> > The customer tried reproducing the issue with 92f1655aa2b22 ("net: fix __dst_negative_advice() race") reverted and the issue appeared again. My next step was capturing more points within the stacktraces when dst refcounts are changed (I do not have full stack traces - the trace entries I store in the hash contain only the instruction pointer and the parent function). This is the trace for the leaked dst object:
> > > dst ff1c4157413bf900
> > > alloc: 1 destroy:  obj diff: 1
> > > hold ops: 5 put ops: 4 refcnt diff: 1
> > > Function                            Parent                           Op  Net            Device Dst              Number of Calls
> > > inet6_csk_route_socket+0x1c2/0x2d0: inet6_csk_update_pmtu+0x58/0x90: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > > __ip6_rt_update_pmtu+0x183/0x3c0: inet6_csk_update_pmtu+0x4b/0x90: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > > ip6_dst_lookup_flow+0x4f/0x1d0: inet6_csk_route_socket+0x198/0x2d0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > > ip6_dst_ifdown+0x5/0x250: dst_dev_put+0x2d/0xd0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > > tcp_v6_connect+0x37e/0x790: __inet_stream_connect+0x2c8/0x3a0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> > > ip6_dst_lookup_flow+0x4f/0x1d0: tcp_v6_connect+0x320/0x790: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> > > __dev_queue_xmit+0x1af/0xd20: ip6_finish_output2+0x1f1/0x6e0: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 675
> > > inet6_csk_xmit+0xa1/0x150: __tcp_transmit_skb+0x5f8/0xd40: dst ff1c4151bcac1080 eth0 ff1c4157413bf900 675
> > > Function                            Parent                           Op      Net            Device Dst              Refcount Diff
> > > ip6_route_output_flags+0x76/0x230: ip6_dst_lookup_tail+0x215/0x250: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > > ip6_negative_advice+0x66/0x2d0: tcp_retransmit_timer+0x5ed/0xac0: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > > dst_alloc+0x5e/0x180: ip6_dst_alloc+0x27/0x60: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > > ip6_route_output_flags+0x76/0x230: ip6_dst_lookup_tail+0x10f/0x250: dst_hold ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> > > dst_release+0x32/0x140: ip6_negative_advice+0x137/0x2d0: dst_put ff1c4151bcac1080 eth0 ff1c4157413bf900 1
> > > dst_release+0x32/0x140: inet_sock_destruct+0x146/0x1c0: dst_put ff1c4151bcac1080 eth0 ff1c4157413bf900 2
> > > dst_release+0x32/0x140: rt6_remove_exception.part.53+0x7f/0xe0: dst_put ffffffff90ee2a80 blackhole_dev ff1c4157413bf900 1
> > > Function                 Parent                      Op     Net              Dst              Socket           Number of Calls
> > > sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c4158c07aaf80 1
> > > sk_setup_caps+0x5/0x200: tcp_v6_connect+0x428/0x790: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c418e49bac280 1
> > > sk_setup_caps+0x5/0x200: inet6_csk_route_socket+0x23c/0x2d0: sk_dst ff1c4151bcac1080 ff1c4157413bf900 ff1c418ea04f7200 1
> >
> > The refcount of the dst was changed only when a dst_hold or dst_put operation was logged. The hold in dst_alloc() is balanced by the dst_release() in rt6_remove_exception(). The three holds in ip6_route_output_flags() belonged to the references held by sockets which were balanced by the dst_release() in inet_sock_destruct(). But that leaves one hold operation outstanding and unbalanced. There is a put operation for expired cached dst objects in ip6_negative_advice() but that one is intentionally balanced by a hold in the same function.
> >
> This makes sense to me.
> 
> The code prior to 92f1655aa2b22 ("net: fix __dst_negative_advice()
> race") had no longer been able to release the cached dst for the
> reference held by socket in ip6_negative_advice() since
> rt6_remove_exception_rt() is called there.
> 
> Hi, David Ahern,
> 
> Can you confirm this?

I have submitted "[PATCH net] net/ipv6: release expired exception dst cached in socket" which should have clearer explanations not burdened by the debugging process. I initially debugged the issue on a customer's server without having any idea how to reproduce it. I then reconstructed the steps to reproduce the issue from the tracing data, which allowed me to check for the issue in the latest upstream kernel.
-- 
Jiri Wiesner
SUSE Labs