netdev - Re: IPv6 neighbor discovery issues on 4.18 (and now 4.19)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <de54e925-9536-f2cc-7b89-7205b3fb2c18@gameservers.com>
Date:   Fri, 11 Jan 2019 13:19:54 -0500
From:   Brian Rak <brak@...eservers.com>
To:     netdev@...r.kernel.org
Subject: Re: IPv6 neighbor discovery issues on 4.18 (and now 4.19)


On 1/9/2019 4:33 PM, Brian Rak wrote:
>
> On 8/31/2018 10:49 AM, Brian Rak wrote:
>> We've upgraded a few machines to a 4.18.3 kernel and we're running 
>> into weird IPv6 neighbor discovery issues.  Basically, the machines 
>> stop responding to inbound IPv6 neighbor solicitation requests, which 
>> very quickly breaks all IPv6 connectivity.
>>
>> It seems like the routing table gets confused:
>>
>> # ip -6 route get fe80::4e16:fc00:c7a0:7800 dev br0
>> RTNETLINK answers: Network is unreachable
>> # ping6 fe80::4e16:fc00:c7a0:7800 -I br0
>> connect: Network is unreachable
>> yet
>>
>> # ip -6 route | grep fe80 | grep br0
>> fe80::/64 dev br0 proto kernel metric 256 pref medium
>>
>> fe80::4e16:fc00:c7a0:7800 is the link-local IP of the server's 
>> default gateway.
>>
>> In this case, br0 has a single adapter attached to it.
>>
>> I haven't been able to come up with any sort of reproduction steps 
>> here, this seems to happen after a few days of uptime in our 
>> environment.  The last known good release we have here is 4.17.13.
>>
>> Any suggestions for troubleshooting this?  Sometimes we see machines 
>> fix themselves, but we haven't been able to figure out what's 
>> happening that helps.
>>
> So, we're still seeing this on 4.19.13.  I've been investigating this 
> a little further and have discovered a few more things:
>
> The server also fails to respond to IPv6 neighbor discovery requests:
>
> 16:12:10.181769 IP6 fe80::629c:9fff:fe22:4b80 > ff02::1:ff00:33: 
> ICMP6, neighbor solicitation, who has 2001:x::33, length 32
>
> But this IP is configured properly:
>
> # ip -6 addr show dev br0
> 7: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
>     inet6 2001:x::33/64 scope global
>        valid_lft forever preferred_lft forever
>     inet6 fe80::ec4:7aff:fe88:c48c/64 scope link
>        valid_lft forever preferred_lft forever
>
> I found some instructions that suggest using `perf` to determine where 
> packets are getting dropped, so I tried: perf record -g -a -e 
> skb:kfree_skb; perf script, which showed me this seemingly relevant 
> places (and a bunch of other drops):
>
> swapper     0 [037] 161501.062542: skb:kfree_skb: 
> skbaddr=0xffff968771988600 protocol=34525 location=0xffffffff94796c6a
>         ffffffff9468d50b kfree_skb+0x7b ([kernel.kallsyms])
>         ffffffff94796c6a ndisc_send_skb+0x2fa ([kernel.kallsyms])
>         ffffffff947975b4 ndisc_send_na+0x184 ([kernel.kallsyms])
>         ffffffff94798143 ndisc_recv_ns+0x2f3 ([kernel.kallsyms])
>         ffffffff94799b46 ndisc_rcv+0xe6 ([kernel.kallsyms])
>         ffffffff947a1fa8 icmpv6_rcv+0x428 ([kernel.kallsyms])
>         ffffffff9477bcd3 ip6_input_finish+0xf3 ([kernel.kallsyms])
>         ffffffff9477c11f ip6_input+0x3f ([kernel.kallsyms])
>         ffffffff9477c787 ip6_mc_input+0x97 ([kernel.kallsyms])
>         ffffffff9477c0cc ip6_rcv_finish+0x7c ([kernel.kallsyms])
>         ffffffff947d9fd2 ip_sabotage_in+0x42 ([kernel.kallsyms])
>         ffffffff946f3822 nf_hook_slow+0x42 ([kernel.kallsyms])
>         ffffffff9477c569 ipv6_rcv+0xc9 ([kernel.kallsyms])
>         ffffffff946a5de7 __netif_receive_skb_one_core+0x57 
> ([kernel.kallsyms])
>         ffffffff946a5e48 __netif_receive_skb+0x18 ([kernel.kallsyms])
>         ffffffff946a5145 netif_receive_skb_internal+0x45 
> ([kernel.kallsyms])
>         ffffffff946a520c netif_receive_skb+0x1c ([kernel.kallsyms])
>         ffffffff947c7d03 br_netif_receive_skb+0x43 ([kernel.kallsyms])
>         ffffffff947c7ded br_pass_frame_up+0xcd ([kernel.kallsyms])
>         ffffffff947c80ca br_handle_frame_finish+0x24a ([kernel.kallsyms])
>         ffffffff947dae0f br_nf_hook_thresh+0xdf ([kernel.kallsyms])
>         ffffffff947dbf19 br_nf_pre_routing_finish_ipv6+0x109 
> ([kernel.kallsyms])
>         ffffffff947dc39a br_nf_pre_routing_ipv6+0xfa ([kernel.kallsyms])
>         ffffffff947dbbe9 br_nf_pre_routing+0x1c9 ([kernel.kallsyms])
>         ffffffff946f3822 nf_hook_slow+0x42 ([kernel.kallsyms])
>         ffffffff947c850f br_handle_frame+0x1ef ([kernel.kallsyms])
>         ffffffff946a5471 __netif_receive_skb_core+0x211 
> ([kernel.kallsyms])
>         ffffffff946a5dcb __netif_receive_skb_one_core+0x3b 
> ([kernel.kallsyms])
>         ffffffff946a5e48 __netif_receive_skb+0x18 ([kernel.kallsyms])
>         ffffffff946a5145 netif_receive_skb_internal+0x45 
> ([kernel.kallsyms])
>         ffffffff946a6fb0 napi_gro_receive+0xd0 ([kernel.kallsyms])
>         ffffffffc05c319f ixgbe_clean_rx_irq+0x46f ([kernel.kallsyms])
>         ffffffffc05c4610 ixgbe_poll+0x280 ([kernel.kallsyms])
>         ffffffff946a6729 net_rx_action+0x289 ([kernel.kallsyms])
>         ffffffff94c000d1 __softirqentry_text_start+0xd1 
> ([kernel.kallsyms])
>         ffffffff94075108 irq_exit+0xe8 ([kernel.kallsyms])
>         ffffffff94a01a69 do_IRQ+0x59 ([kernel.kallsyms])
>         ffffffff94a0098f ret_from_intr+0x0 ([kernel.kallsyms])
>         ffffffff9464e01d cpuidle_enter_state+0xbd ([kernel.kallsyms])
>         ffffffff9464e287 cpuidle_enter+0x17 ([kernel.kallsyms])
>         ffffffff940a3cd3 call_cpuidle+0x23 ([kernel.kallsyms])
>         ffffffff940a3f78 do_idle+0x1c8 ([kernel.kallsyms])
>         ffffffff940a4203 cpu_startup_entry+0x73 ([kernel.kallsyms])
>         ffffffff9403fade start_secondary+0x1ae ([kernel.kallsyms])
>         ffffffff940000d4 secondary_startup_64+0xa4 ([kernel.kallsyms])
>
> However, I can't seem to determine why this is failing.  It seems like 
> the only way to hit kfree_skb within ndisc_send_skb would be if 
> icmp6_dst_alloc fails?


So, I applied a dumb patch to log failures:

diff -baur linux-4.19.13/net/ipv6/ndisc.c 
linux-4.19.13-dirty/net/ipv6/ndisc.c
--- linux-4.19.13/net/ipv6/ndisc.c    2018-12-29 07:37:59.000000000 -0500
+++ linux-4.19.13-dirty/net/ipv6/ndisc.c    2019-01-09 
16:37:59.140042846 -0500
@@ -470,6 +470,7 @@
          icmpv6_flow_init(sk, &fl6, type, saddr, daddr, oif);
          dst = icmp6_dst_alloc(skb->dev, &fl6);
          if (IS_ERR(dst)) {
+            net_warn_ratelimited("Dropping ndisc response due to 
icmp6_dst_alloc failure: %d", PTR_ERR(dst));
              kfree_skb(skb);
              return;
          }

Which ends up producing a bunch of this:

[73531.594663] ICMPv6: Dropping ndisc response due to icmp6_dst_alloc 
failure: -12
[73532.361678] ICMPv6: Dropping ndisc response due to icmp6_dst_alloc 
failure: -12
[73533.319860] ICMPv6: Dropping ndisc response due to icmp6_dst_alloc 
failure: -12
[73534.089759] ICMPv6: Dropping ndisc response due to icmp6_dst_alloc 
failure: -12

That seems to be ENOMEM, which suggests that dst_alloc is failing 
somehow (as ip6_dst_alloc looks to be a simple wrapper around dst_alloc).

If I look at `trace-cmd record -p function -l ip6_dst_gc`, I see that 
this function is getting called about once a second..

I have net.ipv6.route.max_size=4096, and the machine only has 376 routes 
(calculated by `ip -6 route | wc -l`).  However, raising this sysctl to 
65k seems to instantly fix IPv6 (I'm not sure if this is a permanent fix 
yet)

Does this indicate that the machine is leaking IPv6 dst_entry? How would 
I determine what is leaking?

This is from shortly after raising the max_size:

# cat /proc/net/rt6_stats
02b9 015f 13e597 04ab 0000 1031 0b3c