[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241001152609.GA24007@incl>
Date: Tue, 1 Oct 2024 17:26:09 +0200
From: Jiri Wiesner <jwiesner@...e.de>
To: Eric Dumazet <edumazet@...gle.com>
Cc: netdev@...r.kernel.org, David Ahern <dsahern@...nel.org>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
"David S. Miller" <davem@...emloft.net>
Subject: Re: [RFC PATCH] ipv6: route: release reference of dsts cached in
sockets
On Mon, Sep 30, 2024 at 08:27:50PM +0200, Eric Dumazet wrote:
> On Mon, Sep 30, 2024 at 8:09 PM Jiri Wiesner <jwiesner@...e.de> wrote:
> >
> > An unbalanced refcount was reported for the loopback device of a net
> > namespace being destroyed:
> > unregister_netdevice: waiting for lo to become free. Usage count = 2
> >
> > Analysis revealed that the IPv6 net device corresponding to the loopback
> > device did not get destroyed (in6_dev_finish_destroy() was not called).
> > The IPv6 loopback device had an unbalaced refcount:
> > net dev 73da100 lo refcount 1
> > Operation Count Balancing Operation Count
> > hold __ipv6_dev_mc_inc 2 ma_put 2
> > addrconf_ifdown 1 0 unbalanced
> > hold fib6_nh_init 2 fib6_nh_init 2
> > put inet6_ifa_finish_destroy 1 ipv6_add_addr 1
> > ip6_dst_destroy 90 0 unbalanced
> > ip6_dst_ifdown 90 0 unbalanced
> > hold ip6_route_dev_notify 6 ip6_route_dev_notify 6
> > hold ipv6_add_addr 1 inet6_ifa_finish_destroy 1
> > put ma_put 2 __ipv6_dev_mc_inc 2
> > hold ndisc_netdev_event 2 ndisc_netdev_event 2
> > rt6_disable_ip 1 0 unbalanced
> >
> > The refcount of addrconf_ifdown() balances the refcount increment in
> > ipv6_add_dev(), which had no corresponding trace entry. The
> > rt6_disable_ip() and ip6_dst_ifdown() entries were hold operations on the
> > looback device, and the ip6_dst_destroy() entries were put operations. One
> > refcount decrement in ip6_dst_destroy() was not executed. At this point, a
> > hash was implemented in the debug kernel to hold the changes of the
> > refcount of dst objects per namespace. The trace for the dst object that
> > did not decrement the IPv6 refcount of loopback follows:
> >
> > Function Parent Op Net Device Dst Refcount Diff
> > ip6_dst_ifdown: dst_dev_put: dst ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > ip6_negative_advice: tcp_retransmit_timer: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > dst_alloc: ip6_dst_alloc: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > ip6_route_output_flags: ip6_dst_lookup_tail: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 84
> > dst_release: ip6_negative_advice: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > dst_release: tcp_retransmit_timer: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 20
> > dst_release: inet_sock_destruct: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 29
> > dst_release: __dev_queue_xmit: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 34
> > dst_release: rt6_remove_exception: dst_put ffffffff9c8e2a80 blackhole_dev ff404af71ffc9c00 1
> >
> > The ip6_dst_ifdown() trace entry was neither a hold nor a put - it merely
> > indicates that the net device of the dst object was changed to
> > blackhole_dev and the IPv6 refcount was transferred onto the loopback
> > device. There was no ip6_dst_destroy() trace entry, which means the dst
> > object was not destroyed. There were 86 hold operations but only 85 put
> > operations so the dst object was not destroyed because the refcount of the
> > dst object was unbalanced.
> >
> > The problem is that the refcount sums are ambiguous. The most probable
> > explanation is this: The dst object was a route for an IPv6 TCP connection
> > that kept timing out. Sometimes, the process closed the socket, which
> > corresponds to the refcount decrements of the
> > dst_release()/inet_sock_destruct() entries. Sometimes, the TCP retransmit
> > timer reset the dst of the sockets, which corresponds to the
> > dst_release()/tcp_retransmit_timer() entries. I am unsure about the
> > dst_release()/__dev_queue_xmit() entries because inet6_csk_xmit() sets
> > skb->_skb_refdst with SKB_DST_NOREF.
> >
> > The feature that sets the above trace apart from all the other dst traces
> > is the execution of ip6_negative_advice() for a cached and also expired
> > dst object in the exception table. The cached and expired dst object has
> > its refcount set to at least 2 before executing rt6_remove_exception_rt()
> > found in ip6_negative_advice(). One decrement happens in
> > rt6_remove_exception() after the dst object has been removed from the
> > exception table. The other decrement happens in sk_dst_reset() but that
> > one is counteracted by a dst_hold() intentionally placed just before the
> > sk_dst_reset() in ip6_negative_advice(). The probem is that a socket that
> > keeps a reference to a dst in its sk_dst_cache member increments the
> > refcount of the dst by 1. This is apparent in the following code paths:
> > * When ip6_route_output_flags() finds a dst that is then stored in
> > sk->sk_dst_cache by ip6_dst_store() called from inet6_csk_route_socket().
> > * When inet_sock_destruct() calls dst_release() for sk->sk_dst_cache
> > Provided the dst is not kept in the sk_dst_cache member of another socket,
> > there is no other object tied to the dst (the socket lost its reference
> > and the dst is no longer in the exception table) and the dst becomes a
> > leaked object after ip6_negative_advice() finishes. This leak then
> > precludes the net namespace from being destroyed.
> >
> > The patch that introduced the dst_hold() in ip6_negative_advice() was
> > 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). But 92f1655aa2b22
> > only refactored the code with regards to the dst refcount so the issue was
> > present even before 92f1655aa2b22.
> >
> > Signed-off-by: Jiri Wiesner <jwiesner@...e.de>
> > ---
> > At the moment, I am sending this as an RFC because I am not able to
> > reproduce the issue in-house. The customer that encountered the issue is
> > currently running tests. For the customer's testing, I fixed the issue
> > with a kprobe module that calls dst_release() right after
> > rt6_remove_exception_rt() returns in ip6_negative_advice(), which is not
> > quite the same as the change proposed below.
> >
> > net/ipv6/route.c | 3 ---
> > 1 file changed, 3 deletions(-)
> >
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index b4251915585f..b70267c8d251 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -2780,10 +2780,7 @@ static void ip6_negative_advice(struct sock *sk,
> > if (rt->rt6i_flags & RTF_CACHE) {
> > rcu_read_lock();
> > if (rt6_check_expired(rt)) {
> > - /* counteract the dst_release() in sk_dst_reset() */
> > - dst_hold(dst);
> > sk_dst_reset(sk);
> > -
> > rt6_remove_exception_rt(rt);
> > }
> > rcu_read_unlock();
> > --
>
> Interesting, what kernel version is your customer using ?
It is a distribution kernel (SLES 15 SP5) that is based on 5.14. The newest SLES service pack, SP6, has a 6.4-based kernel. But the customer has yet to move to that one.
> I think that with recent kernels (after 5.18), we do not see issues
> because of the use of blackhole_netdev
> instead of loopback.
Thank you very much for mentioning that. I will be backporting these patches to SUSE's 5.14-based kernels:
> 9cc341286e99 ("dn_route: set rt neigh to blackhole_netdev instead of loopback_dev in ifdown")
> 4d33ab08c0af ("xfrm: set dst dev to blackhole_netdev instead of loopback_dev in ifdown")
> dd263a8cb194 ("ipv6: blackhole_netdev needs snmp6 counters")
> e5f80fcf869a ("ipv6: give an IPv6 dev to blackhole_netdev")
These patches will not resolve the dst leak but being able to destroy net namespaces at the expense of sustaining infrequently occurring dst leaks is a favourable trade.
The testing results (debugging SLES15 SP5 kernel plus the kprobe module calling dst_release() in probe_ip6_negative_advice()) came back and the kernel log indicates negative overflow of the dst refcount. The first machine:
> [ 9366.700330] dst_release underflow
> [ 9366.700346] WARNING: CPU: 42 PID: 0 at ../net/core/dst.c:176 dst_release+0x19f/0x1b0
> [ 9366.701492] Call Trace:
> [ 9366.701497] <IRQ>
> [ 9366.701506] probe_ip6_negative_advice+0x12/0x30 [probe 0c57153a505ed7cc20e2c5db9522a1a524ec4f56]
> [ 9366.701514] opt_pre_handler+0x3d/0x70
> [ 9366.701528] optimized_callback+0x64/0x90
> [ 9366.701538] 0xffffffffc0476032
> [ 9366.701548] ? ip6_negative_advice+0xdf/0x180
> [ 9366.701558] ? tcp_retransmit_timer+0x658/0x930
> [ 9366.701595] ? tcp_write_timer_handler+0xba/0x1f0
> [ 9366.701601] ? tcp_write_timer+0x211/0x260
> [ 9366.701624] ? call_timer_fn+0x27/0x130
> [ 9366.701631] ? run_timer_softirq+0x443/0x480
> [ 9366.701645] ? __do_softirq+0xd2/0x2c0
> [ 9366.701654] ? irq_exit_rcu+0xa4/0xc0
> [ 9366.701661] ? sysvec_apic_timer_interrupt+0x50/0x90
> [ 9366.701667] </IRQ>
> [ 9366.701744] dst_release: dst:00000000ac3b9120 refcnt:-1
> [ 9366.701757] dst_release: dst:00000000ac3b9120 refcnt:-2
> [ 9382.832164] dst_release: dst:00000000ac3b9120 refcnt:-3
> [ 9659.561679] dst_release: dst:00000000494633c6 refcnt:-1
Obviously, the refcount of the cached and expired dst, dst:00000000ac3b9120, was 1 before probe_ip6_negative_advice() was called from tcp_write_timeout()/tcp_retransmit_timer(). So, my supposition that cached dsts being passed to probe_ip6_negative_advice() have a refcount of at least 2 was wrong.
The second machine:
[11479.272704] dst_release underflow
[11479.272718] WARNING: CPU: 52 PID: 0 at ../net/core/dst.c:176 dst_release+0x19f/0x1b0
[11479.272924] Call Trace:
[11479.272928] <IRQ>
[11479.272932] tcp_retransmit_timer+0x47c/0x930
[11479.272953] tcp_write_timer_handler+0xba/0x1f0
[11479.272956] tcp_write_timer+0x211/0x260
[11479.272962] call_timer_fn+0x27/0x130
[11479.272966] run_timer_softirq+0x443/0x480
[11479.272973] __do_softirq+0xd2/0x2c0
[11479.272980] irq_exit_rcu+0xa4/0xc0
[11479.272985] sysvec_apic_timer_interrupt+0x50/0x90
[11479.272989] </IRQ>
[11479.273044] dst_release: dst:00000000ff7e6845 refcnt:-1
[11681.380331] dst_release: dst:00000000893d8a2d refcnt:-1
In this case, the refcount of dst:00000000ff7e6845 was have been 2 before probe_ip6_negative_advice() so the dst_release() called from the kprobe handler decremented the refcount to 0. No warning was printed. There probably was another socket having the dst stored in sk_dst_cache, to which the refcount of 1 belonged. When the second __sk_dst_reset() was called in tcp_retransmit_timer(), which was the tcp_retransmit_timer+0x47c site, the refcount dropped below 0.
I am afraid this patch is misguided. I would still like to find the source of the dst leak but I am also running out of time which the customer is willing to invest into investigating this issue.
--
Jiri Wiesner
SUSE Labs
Powered by blists - more mailing lists