[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20241118190546.GC19776@incl>
Date: Mon, 18 Nov 2024 20:05:46 +0100
From: Jiri Wiesner <jwiesner@...e.de>
To: Paolo Abeni <pabeni@...hat.com>
Cc: netdev@...r.kernel.org, "David S. Miller" <davem@...emloft.net>,
David Ahern <dsahern@...nel.org>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Xin Long <lucien.xin@...il.com>,
yousaf.kaukab@...e.com, andreas.taschner@...e.com
Subject: Re: [PATCH net] net/ipv6: release expired exception dst cached in
socket
On Thu, Nov 14, 2024 at 12:04:36PM +0100, Paolo Abeni wrote:
> On 11/13/24 11:56, Jiri Wiesner wrote:
> > Dst objects get leaked in ip6_negative_advice() when this function is
> > executed for an expired IPv6 route located in the exception table. There
> > are several conditions that must be fulfilled for the leak to occur:
> > * an ICMPv6 packet indicating a change of the MTU for the path is received
> > * a TCP connection that uses the exception dst for routing packets must
> > start timing out so that TCP begins retransmissions
> > * after the exception dst expires, the FIB6 garbage collector must not run
> > before TCP executes ip6_negative_advice() for the expired exception dst
> >
> > The following steps reproduce the issue:
> >
> > ip link add veth1 mtu 65535 type veth peer veth0 mtu 65535
> > ip netns add ns0
> > ip link set veth1 netns ns0
> > ip addr add fd00::1/24 dev veth0
> > ip -n ns0 addr add fd00::2/24 dev veth1
> > ip link set up dev veth0
> > ip -n ns0 link set up dev lo
> > ip -n ns0 link set up dev veth1
> > ip -n ns0 route add default via fd00::1 dev veth1
> >
> > ip link add veth3 mtu 65535 type veth peer veth2 mtu 65535
> > ip netns add ns2
> > ip link set veth3 netns ns2
> > ip addr add fd02::1/24 dev veth2
> > ip -n ns2 addr add fd02::2/24 dev veth3
> > ip link set up dev veth2
> > ip -n ns2 link set up dev lo
> > ip -n ns2 link set up dev veth3
> > ip -n ns2 route add default via fd02::1 dev veth3
> >
> > ip netns exec ns0 bash -c "echo 6 > /proc/sys/net/ipv6/route/mtu_expires"
> > ip netns exec ns0 bash -c "echo 900 > /proc/sys/net/ipv6/route/gc_interval"
> > sleep 30
> >
> > ip6tables -A FORWARD -i veth0 -d fd02::/24 -j ACCEPT
> > ip6tables -A FORWARD -i veth2 -d fd00::/24 -j ACCEPT
> > echo 1 > /proc/sys/net/ipv6/conf/all/forwarding
> >
> > (ip netns exec ns2 netcat -6 -l -s fd02::2 -p 1234 &> /dev/null) & serv=$!
> > sleep 1
> > dd if=/dev/zero bs=1M | ip netns exec ns0 netcat -6 fd02::2 1234 & clnt=$!
> > sleep 1
> > ip link set veth2 mtu 2000
> > sleep 1
> > ip6tables -D FORWARD -i veth2 -d fd00::/24 -j ACCEPT
> > ip6tables -A FORWARD -i veth2 -d fd00::/24 -j DROP
> >
> > sleep 10
> > kill $clnt $serv
> > wait $serv
> >
> > ip6tables -D FORWARD -i veth0 -d fd02::/24 -j ACCEPT
> > ip6tables -D FORWARD -i veth2 -d fd00::/24 -j DROP
> >
> > ip -n ns0 link set down dev lo
> > ip -n ns0 link set down dev veth1
> > ip -n ns0 link delete dev veth1
> > ip netns delete ns0
> >
> > ip -n ns2 link set down dev lo
> > ip -n ns2 link set down dev veth3
> > ip -n ns2 link delete dev veth3
> > ip netns delete ns2
> >
> > This trace has been created with kprobes under kernel 6.12-rc7. Upon
> > receiving an ICMPv6 packet indicating an MTU change, exception dst
> > 0xff3027eec766c100 is created and inserted into the IPv6 exception table:
> > 3651.126884: rt6_insert_exception: (rt6_insert_exception+0x0/0x2b0) dst=0xff3027eec766c100 rcuref=0
> > 3651.126889: <stack trace>
> > => rt6_insert_exception+0x5/0x2b0
> > => __ip6_rt_update_pmtu+0x1ef/0x380
> > => inet6_csk_update_pmtu+0x4b/0x90
> > => tcp_v6_mtu_reduced.part.22+0x34/0xc0
> > The exception dst is used to route packets:
> > 3651.126902: inet6_csk_route_socket: (inet6_csk_update_pmtu+0x58/0x90 <- inet6_csk_route_socket) dst=0xff3027eec766c100
> > At this point, the connection has been severed and TCP starts retransmissions:
> > 3652.349466: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80
> > 3652.349497: inet6_csk_route_socket: (inet6_csk_xmit+0x56/0x130 <- inet6_csk_route_socket) dst=0xff3027eec766c100
> > 3652.769469: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80
> > 3652.769495: inet6_csk_route_socket: (inet6_csk_xmit+0x56/0x130 <- inet6_csk_route_socket) dst=0xff3027eec766c100
> > 3653.596135: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80
> > 3653.596156: inet6_csk_route_socket: (inet6_csk_xmit+0x56/0x130 <- inet6_csk_route_socket) dst=0xff3027eec766c100
> > 3655.249465: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80
> > 3655.249490: inet6_csk_route_socket: (inet6_csk_xmit+0x56/0x130 <- inet6_csk_route_socket) dst=0xff3027eec766c100
> > 3658.689463: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80
> > When ip6_negative_advice() is executed the refcount is 2 - the increment
> > made by dst_init() and the increment made by the socket:
> > 3658.689475: ip6_negative_advice: (ip6_negative_advice+0x0/0xa0) net=0xff3027ef4a0f3780 sk=0xff3027eeeb1f3d80 dst=0xff3027eec766c100 rcuref=1
> > This is the result of dst_hold() and sk_dst_reset() in ip6_negative_advice():
> > 3658.689477: dst_release: (dst_release+0x0/0x80) dst=0xff3027eec766c100 rcuref=2
> > 3658.689498: rt6_remove_exception_rt: (rt6_remove_exception_rt+0x0/0xa0) dst=0xff3027eec766c100 rcuref=1
> > 3658.689501: rt6_remove_exception: (rt6_remove_exception.part.58+0x0/0xe0) dst=0xff3027eec766c100 rcuref=1
> > The refcount of dst 0xff3027eec766c100 is decremented by 1 as a result of
> > removing the exception from the exception table with
> > rt6_remove_exception_rt():
> > 3658.689505: dst_release: (dst_release+0x0/0x80) dst=0xff3027eec766c100 rcuref=1
> > The retransmissions continue without the exception dst being used for
> > routing packets:
> > 3662.352796: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef00cab780 sk=0xff3027eeeb1f0000
> > 3662.769470: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef00cab780 sk=0xff3027eeeb1f0000
> > 3663.596132: tcp_retransmit_timer: (tcp_retransmit_timer+0x0/0xba0) net=0xff3027ef00cab780 sk=0xff3027eeeb1f0000
> >
> > The ip6_dst_destroy() function was instrumented but there was no entry for
> > dst 0xff3027eec766c100 in the trace even long after the ns0 and ns2 net
> > namespaces had been destroyed. The refcount made by the socket was never
> > released. The refcount of the dst is decremented in sk_dst_reset() but
> > that decrement is counteracted by a dst_hold() intentionally placed just
> > before the sk_dst_reset() in ip6_negative_advice(). The probem is that
> > sockets that keep a reference to a dst in the sk_dst_cache member
> > increment the refcount of the dst by 1. This is apparent in the following
> > code paths:
> > * When ip6_route_output_flags() finds a dst that is then stored in
> > sk->sk_dst_cache by ip6_dst_store() called from inet6_csk_route_socket()
> > * When inet_sock_destruct() calls dst_release() for sk->sk_dst_cache
> > Provided the dst is not kept in the sk_dst_cache member of another socket,
> > there is no other object tied to the dst (the socket lost its reference
> > and the dst is no longer in the exception table) and the dst becomes a
> > leaked object after ip6_negative_advice() has finished.
> >
> > As a result of this dst leak, an unbalanced refcount is reported for the
> > loopback device of a net namespace being destroyed under kernels that do
> > not contain e5f80fcf869a ("ipv6: give an IPv6 dev to blackhole_netdev"):
> > unregister_netdevice: waiting for lo to become free. Usage count = 2
> >
> > Fix the dst leak by removing the dst_hold() in ip6_negative_advice(). The
> > patch that introduced the dst_hold() in ip6_negative_advice() was
> > 92f1655aa2b22 ("net: fix __dst_negative_advice() race"). But 92f1655aa2b22
> > merely refactored the code with regards to the dst refcount so the issue
> > was present even before 92f1655aa2b22. The bug was introduced in
> > 54c1a859efd9f ("ipv6: Don't drop cache route entry unless timer actually
> > expired.") where the expired cached route is deleted, the sk_dst_cache
> > member of the socket is set to NULL by calling dst_negative_advice() but
> > the refcount belonging to the socket is left unbalanced.
> >
> > The IPv4 version - ipv4_negative_advice() - is not affected by this bug. A
> > nexthop exception is created and the dst associated with the socket fails
> > a check after the ICMPv6 packet indicating an MTU change has been
> > received. When the TCP connection times out ipv4_negative_advice() merely
> > resets the sk_dst_cache of the socket while decrementing the refcount of
> > the exception dst. Then, the expired nexthop exception is deleted along
> > with its routes in find_exception() while TCP tries to retransmit the same
> > packet again.
> >
> > Fixes: 92f1655aa2b22 ("net: fix __dst_negative_advice() race")
> > Fixes: 54c1a859efd9f ("ipv6: Don't drop cache route entry unless timer actually expired.")
> >
> > Signed-off-by: Jiri Wiesner <jwiesner@...e.de>
>
> The patch makes sense to me, but the changelog is very hard to follow,
I see. Thanks for the feedback.
> please:
> - move the testing code to a new, specific self-tests (possibly tuning
> the timeouts to a [much] shorter runtime)
I have been looking into turning the steps to reproduce the issue in a selftest script. I found out it is serious business and not something done in an hour, which made me question the need for creating a test for this issue. AFAIK, it is not common practice to supply a test script for every fix merged into the kernel. I could reduce the runtime to roughly 15 seconds but I think those 15 seconds would be wasted for testing an issue that may never come back unless someone decides to redesign sk_dst_reset() or rt6_remove_exception() in a rather disruptive way that would result in an unbalanced dst refcount again. I would much rather just drop the steps to reproduce the issue from the changelog. The commands themselves do not make the issue obvious unless someone runs them and uses a tracing approach to log what happens to dsts.
> - clearly separate the probe logs from the describing commenting text
> - try to be slightly a bit less verbose
>
> Additionally please solve a formal problem above: there should be no
> white lines in the tag area between the Fixes and SoB tags.
I will send another patch implementing these suggestions.
J.
Powered by blists - more mailing lists