netdev - Re: [RFC PATCH] ipv6: route: release reference of dsts cached in sockets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADvbK_cmi_ppJyPwmh77dHgkm=Lh52vtEWddwSAFNhZpmmev6Q@mail.gmail.com>
Date: Wed, 2 Oct 2024 16:27:55 -0400
From: Xin Long <lucien.xin@...il.com>
To: Jiri Wiesner <jwiesner@...e.de>
Cc: Eric Dumazet <edumazet@...gle.com>, netdev@...r.kernel.org, 
	David Ahern <dsahern@...nel.org>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	"David S. Miller" <davem@...emloft.net>
Subject: Re: [RFC PATCH] ipv6: route: release reference of dsts cached in sockets

On Tue, Oct 1, 2024 at 11:26 AM Jiri Wiesner <jwiesner@...e.de> wrote:
>
> On Mon, Sep 30, 2024 at 08:27:50PM +0200, Eric Dumazet wrote:
> > On Mon, Sep 30, 2024 at 8:09 PM Jiri Wiesner <jwiesner@...e.de> wrote:
> > >
> > > An unbalanced refcount was reported for the loopback device of a net
> > > namespace being destroyed:
> > > unregister_netdevice: waiting for lo to become free. Usage count = 2
> > >
> > > Analysis revealed that the IPv6 net device corresponding to the loopback
> > > device did not get destroyed (in6_dev_finish_destroy() was not called).
> > > The IPv6 loopback device had an unbalaced refcount:
> > > net dev 73da100 lo refcount 1
> > > Operation                     Count Balancing Operation     Count
> > > hold  __ipv6_dev_mc_inc           2 ma_put                      2
> > >       addrconf_ifdown             1                             0 unbalanced
> > > hold  fib6_nh_init                2 fib6_nh_init                2
> > > put   inet6_ifa_finish_destroy    1 ipv6_add_addr               1
> > >       ip6_dst_destroy            90                             0 unbalanced
> > >       ip6_dst_ifdown             90                             0 unbalanced
> > > hold  ip6_route_dev_notify        6 ip6_route_dev_notify        6
> > > hold  ipv6_add_addr               1 inet6_ifa_finish_destroy    1
> > > put   ma_put                      2 __ipv6_dev_mc_inc           2
> > > hold  ndisc_netdev_event          2 ndisc_netdev_event          2
> > >       rt6_disable_ip              1                             0 unbalanced
> > >
> > > The refcount of addrconf_ifdown() balances the refcount increment in
> > > ipv6_add_dev(), which had no corresponding trace entry. The
> > > rt6_disable_ip() and ip6_dst_ifdown() entries were hold operations on the
> > > looback device, and the ip6_dst_destroy() entries were put operations. One
> > > refcount decrement in ip6_dst_destroy() was not executed. At this point, a
> > > hash was implemented in the debug kernel to hold the changes of the
> > > refcount of dst objects per namespace. The trace for the dst object that
> > > did not decrement the IPv6 refcount of loopback follows:
> > >
> > > Function        Parent       Op  Net            Device Dst              Refcount Diff
> > > ip6_dst_ifdown: dst_dev_put: dst ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > > ip6_negative_advice: tcp_retransmit_timer: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > > dst_alloc: ip6_dst_alloc: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > > ip6_route_output_flags: ip6_dst_lookup_tail: dst_hold ff404b2f073da100 eth0 ff404af71ffc9c00 84
> > > dst_release: ip6_negative_advice: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 1
> > > dst_release: tcp_retransmit_timer: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 20
> > > dst_release: inet_sock_destruct: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 29
> > > dst_release: __dev_queue_xmit: dst_put ff404b2f073da100 eth0 ff404af71ffc9c00 34
> > > dst_release: rt6_remove_exception: dst_put ffffffff9c8e2a80 blackhole_dev ff404af71ffc9c00 1
> > >
> > > The ip6_dst_ifdown() trace entry was neither a hold nor a put - it merely
> > > indicates that the net device of the dst object was changed to
> > > blackhole_dev and the IPv6 refcount was transferred onto the loopback
> > > device. There was no ip6_dst_destroy() trace entry, which means the dst
> > > object was not destroyed. There were 86 hold operations but only 85 put
> > > operations so the dst object was not destroyed because the refcount of the
> > > dst object was unbalanced.
> > >
> > > The problem is that the refcount sums are ambiguous. The most probable
> > > explanation is this: The dst object was a route for an IPv6 TCP connection
> > > that kept timing out. Sometimes, the process closed the socket, which
> > > corresponds to the refcount decrements of the
> > > dst_release()/inet_sock_destruct() entries. Sometimes, the TCP retransmit
> > > timer reset the dst of the sockets, which corresponds to the
> > > dst_release()/tcp_retransmit_timer() entries. I am unsure about the
> > > dst_release()/__dev_queue_xmit() entries because inet6_csk_xmit() sets
> > > skb->_skb_refdst with SKB_DST_NOREF.
> > >
> > > The feature that sets the above trace apart from all the other dst traces
> > > is the execution of ip6_negative_advice() for a cached and also expired
> > > dst object in the exception table. The cached and expired dst object has
> > > its refcount set to at least 2 before executing rt6_remove_exception_rt()
> > > found in ip6_negative_advice(). One decrement happens in
> > > rt6_remove_exception() after the dst object has been removed from the
> > > exception table. The other decrement happens in sk_dst_reset() but that
> > > one is counteracted by a dst_hold() intentionally placed just before the
> > > sk_dst_reset() in ip6_negative_advice(). The probem is that a socket that
> > > keeps a reference to a dst in its sk_dst_cache member increments the
> > > refcount of the dst by 1. This is apparent in the following code paths:
> > > * When ip6_route_output_flags() finds a dst that is then stored in
> > >   sk->sk_dst_cache by ip6_dst_store() called from inet6_csk_route_socket().
> > > * When inet_sock_destruct() calls dst_release() for sk->sk_dst_cache
> > > Provided the dst is not kept in the sk_dst_cache member of another socket,
> > > there is no other object tied to the dst (the socket lost its reference
> > > and the dst is no longer in the exception table) and the dst becomes a
> > > leaked object after ip6_negative_advice() finishes. This leak then
> > > precludes the net namespace from being destroyed.
> > >
> > > The patch that introduced the dst_hold() in ip6_negative_advice() was
> > > 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). But 92f1655aa2b22
> > > only refactored the code with regards to the dst refcount so the issue was
> > > present even before 92f1655aa2b22.
> > >
> > > Signed-off-by: Jiri Wiesner <jwiesner@...e.de>
> > > ---
> > > At the moment, I am sending this as an RFC because I am not able to
> > > reproduce the issue in-house. The customer that encountered the issue is
> > > currently running tests. For the customer's testing, I fixed the issue
> > > with a kprobe module that calls dst_release() right after
> > > rt6_remove_exception_rt() returns in ip6_negative_advice(), which is not
> > > quite the same as the change proposed below.
> > >
> > >  net/ipv6/route.c | 3 ---
> > >  1 file changed, 3 deletions(-)
> > >
> > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > > index b4251915585f..b70267c8d251 100644
> > > --- a/net/ipv6/route.c
> > > +++ b/net/ipv6/route.c
> > > @@ -2780,10 +2780,7 @@ static void ip6_negative_advice(struct sock *sk,
> > >         if (rt->rt6i_flags & RTF_CACHE) {
> > >                 rcu_read_lock();
> > >                 if (rt6_check_expired(rt)) {
> > > -                       /* counteract the dst_release() in sk_dst_reset() */
> > > -                       dst_hold(dst);
> > >                         sk_dst_reset(sk);
> > > -
> > >                         rt6_remove_exception_rt(rt);
> > >                 }
> > >                 rcu_read_unlock();
> > > --
> >
> > Interesting, what kernel version is your customer using ?
>
> It is a distribution kernel (SLES 15 SP5) that is based on 5.14. The newest SLES service pack, SP6, has a 6.4-based kernel. But the customer has yet to move to that one.
>
> > I think that with recent kernels (after 5.18), we do not see issues
> > because of the use of blackhole_netdev
> > instead of loopback.
>
> Thank you very much for mentioning that. I will be backporting these patches to SUSE's 5.14-based kernels:
> > 9cc341286e99 ("dn_route: set rt neigh to blackhole_netdev instead of loopback_dev in ifdown")
> > 4d33ab08c0af ("xfrm: set dst dev to blackhole_netdev instead of loopback_dev in ifdown")
> > dd263a8cb194 ("ipv6: blackhole_netdev needs snmp6 counters")
> > e5f80fcf869a ("ipv6: give an IPv6 dev to blackhole_netdev")
> These patches will not resolve the dst leak but being able to destroy net namespaces at the expense of sustaining infrequently occurring dst leaks is a favourable trade.
>
> The testing results (debugging SLES15 SP5 kernel plus the kprobe module calling dst_release() in probe_ip6_negative_advice()) came back and the kernel log indicates negative overflow of the dst refcount. The first machine:
> > [ 9366.700330] dst_release underflow
> > [ 9366.700346] WARNING: CPU: 42 PID: 0 at ../net/core/dst.c:176 dst_release+0x19f/0x1b0
> > [ 9366.701492] Call Trace:
> > [ 9366.701497]  <IRQ>
> > [ 9366.701506]  probe_ip6_negative_advice+0x12/0x30 [probe 0c57153a505ed7cc20e2c5db9522a1a524ec4f56]
> > [ 9366.701514]  opt_pre_handler+0x3d/0x70
> > [ 9366.701528]  optimized_callback+0x64/0x90
> > [ 9366.701538]  0xffffffffc0476032
> > [ 9366.701548]  ? ip6_negative_advice+0xdf/0x180
> > [ 9366.701558]  ? tcp_retransmit_timer+0x658/0x930
> > [ 9366.701595]  ? tcp_write_timer_handler+0xba/0x1f0
> > [ 9366.701601]  ? tcp_write_timer+0x211/0x260
> > [ 9366.701624]  ? call_timer_fn+0x27/0x130
> > [ 9366.701631]  ? run_timer_softirq+0x443/0x480
> > [ 9366.701645]  ? __do_softirq+0xd2/0x2c0
> > [ 9366.701654]  ? irq_exit_rcu+0xa4/0xc0
> > [ 9366.701661]  ? sysvec_apic_timer_interrupt+0x50/0x90
> > [ 9366.701667]  </IRQ>
> > [ 9366.701744] dst_release: dst:00000000ac3b9120 refcnt:-1
> > [ 9366.701757] dst_release: dst:00000000ac3b9120 refcnt:-2
> > [ 9382.832164] dst_release: dst:00000000ac3b9120 refcnt:-3
> > [ 9659.561679] dst_release: dst:00000000494633c6 refcnt:-1
> Obviously, the refcount of the cached and expired dst, dst:00000000ac3b9120, was 1 before probe_ip6_negative_advice() was called from tcp_write_timeout()/tcp_retransmit_timer(). So, my supposition that cached dsts being passed to probe_ip6_negative_advice() have a refcount of at least 2 was wrong.
>
> The second machine:
> [11479.272704] dst_release underflow
> [11479.272718] WARNING: CPU: 52 PID: 0 at ../net/core/dst.c:176 dst_release+0x19f/0x1b0
> [11479.272924] Call Trace:
> [11479.272928]  <IRQ>
> [11479.272932]  tcp_retransmit_timer+0x47c/0x930
> [11479.272953]  tcp_write_timer_handler+0xba/0x1f0
> [11479.272956]  tcp_write_timer+0x211/0x260
> [11479.272962]  call_timer_fn+0x27/0x130
> [11479.272966]  run_timer_softirq+0x443/0x480
> [11479.272973]  __do_softirq+0xd2/0x2c0
> [11479.272980]  irq_exit_rcu+0xa4/0xc0
> [11479.272985]  sysvec_apic_timer_interrupt+0x50/0x90
> [11479.272989]  </IRQ>
> [11479.273044] dst_release: dst:00000000ff7e6845 refcnt:-1
> [11681.380331] dst_release: dst:00000000893d8a2d refcnt:-1
> In this case, the refcount of dst:00000000ff7e6845 was have been 2 before probe_ip6_negative_advice() so the dst_release() called from the kprobe handler decremented the refcount to 0. No warning was printed. There probably was another socket having the dst stored in sk_dst_cache, to which the refcount of 1 belonged. When the second __sk_dst_reset() was called in tcp_retransmit_timer(), which was the tcp_retransmit_timer+0x47c site, the refcount dropped below 0.
>
> I am afraid this patch is misguided. I would still like to find the source of the dst leak but I am also running out of time which the customer is willing to invest into investigating this issue.
Is your kernel including this commit?

commit 28044fc1d4953b07acec0da4d2fc4784c57ea6fb
Author: Joanne Koong <joannelkoong@...il.com>
Date:   Mon Aug 22 11:10:21 2022 -0700

    net: Add a bhash2 table hashed by port and address

After this commit, it seems in tcp_v6_connect(), the 'goto failure'
may cause a dst leak.:

        dst = ip6_dst_lookup_flow(net, sk, &fl6, final_p);
        ...
        if (!saddr) {
                saddr = &fl6.saddr;

                err = inet_bhash2_update_saddr(sk, saddr, AF_INET6);
                if (err)
                        goto failure; <---
        }
        ...
        ip6_dst_store(sk, dst, NULL, NULL);