[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2b4128d8-65fb-9d01-5158-8a79a7ffc257@gameservers.com>
Date: Wed, 9 Jan 2019 16:33:41 -0500
From: Brian Rak <brak@...eservers.com>
To: netdev@...r.kernel.org
Subject: Re: IPv6 neighbor discovery issues on 4.18 (and now 4.19)
On 8/31/2018 10:49 AM, Brian Rak wrote:
> We've upgraded a few machines to a 4.18.3 kernel and we're running
> into weird IPv6 neighbor discovery issues. Basically, the machines
> stop responding to inbound IPv6 neighbor solicitation requests, which
> very quickly breaks all IPv6 connectivity.
>
> It seems like the routing table gets confused:
>
> # ip -6 route get fe80::4e16:fc00:c7a0:7800 dev br0
> RTNETLINK answers: Network is unreachable
> # ping6 fe80::4e16:fc00:c7a0:7800 -I br0
> connect: Network is unreachable
> yet
>
> # ip -6 route | grep fe80 | grep br0
> fe80::/64 dev br0 proto kernel metric 256 pref medium
>
> fe80::4e16:fc00:c7a0:7800 is the link-local IP of the server's default
> gateway.
>
> In this case, br0 has a single adapter attached to it.
>
> I haven't been able to come up with any sort of reproduction steps
> here, this seems to happen after a few days of uptime in our
> environment. The last known good release we have here is 4.17.13.
>
> Any suggestions for troubleshooting this? Sometimes we see machines
> fix themselves, but we haven't been able to figure out what's
> happening that helps.
>
So, we're still seeing this on 4.19.13. I've been investigating this a
little further and have discovered a few more things:
The server also fails to respond to IPv6 neighbor discovery requests:
16:12:10.181769 IP6 fe80::629c:9fff:fe22:4b80 > ff02::1:ff00:33: ICMP6,
neighbor solicitation, who has 2001:x::33, length 32
But this IP is configured properly:
# ip -6 addr show dev br0
7: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 2001:x::33/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::ec4:7aff:fe88:c48c/64 scope link
valid_lft forever preferred_lft forever
I found some instructions that suggest using `perf` to determine where
packets are getting dropped, so I tried: perf record -g -a -e
skb:kfree_skb; perf script, which showed me this seemingly relevant
places (and a bunch of other drops):
swapper 0 [037] 161501.062542: skb:kfree_skb:
skbaddr=0xffff968771988600 protocol=34525 location=0xffffffff94796c6a
ffffffff9468d50b kfree_skb+0x7b ([kernel.kallsyms])
ffffffff94796c6a ndisc_send_skb+0x2fa ([kernel.kallsyms])
ffffffff947975b4 ndisc_send_na+0x184 ([kernel.kallsyms])
ffffffff94798143 ndisc_recv_ns+0x2f3 ([kernel.kallsyms])
ffffffff94799b46 ndisc_rcv+0xe6 ([kernel.kallsyms])
ffffffff947a1fa8 icmpv6_rcv+0x428 ([kernel.kallsyms])
ffffffff9477bcd3 ip6_input_finish+0xf3 ([kernel.kallsyms])
ffffffff9477c11f ip6_input+0x3f ([kernel.kallsyms])
ffffffff9477c787 ip6_mc_input+0x97 ([kernel.kallsyms])
ffffffff9477c0cc ip6_rcv_finish+0x7c ([kernel.kallsyms])
ffffffff947d9fd2 ip_sabotage_in+0x42 ([kernel.kallsyms])
ffffffff946f3822 nf_hook_slow+0x42 ([kernel.kallsyms])
ffffffff9477c569 ipv6_rcv+0xc9 ([kernel.kallsyms])
ffffffff946a5de7 __netif_receive_skb_one_core+0x57
([kernel.kallsyms])
ffffffff946a5e48 __netif_receive_skb+0x18 ([kernel.kallsyms])
ffffffff946a5145 netif_receive_skb_internal+0x45
([kernel.kallsyms])
ffffffff946a520c netif_receive_skb+0x1c ([kernel.kallsyms])
ffffffff947c7d03 br_netif_receive_skb+0x43 ([kernel.kallsyms])
ffffffff947c7ded br_pass_frame_up+0xcd ([kernel.kallsyms])
ffffffff947c80ca br_handle_frame_finish+0x24a ([kernel.kallsyms])
ffffffff947dae0f br_nf_hook_thresh+0xdf ([kernel.kallsyms])
ffffffff947dbf19 br_nf_pre_routing_finish_ipv6+0x109
([kernel.kallsyms])
ffffffff947dc39a br_nf_pre_routing_ipv6+0xfa ([kernel.kallsyms])
ffffffff947dbbe9 br_nf_pre_routing+0x1c9 ([kernel.kallsyms])
ffffffff946f3822 nf_hook_slow+0x42 ([kernel.kallsyms])
ffffffff947c850f br_handle_frame+0x1ef ([kernel.kallsyms])
ffffffff946a5471 __netif_receive_skb_core+0x211 ([kernel.kallsyms])
ffffffff946a5dcb __netif_receive_skb_one_core+0x3b
([kernel.kallsyms])
ffffffff946a5e48 __netif_receive_skb+0x18 ([kernel.kallsyms])
ffffffff946a5145 netif_receive_skb_internal+0x45
([kernel.kallsyms])
ffffffff946a6fb0 napi_gro_receive+0xd0 ([kernel.kallsyms])
ffffffffc05c319f ixgbe_clean_rx_irq+0x46f ([kernel.kallsyms])
ffffffffc05c4610 ixgbe_poll+0x280 ([kernel.kallsyms])
ffffffff946a6729 net_rx_action+0x289 ([kernel.kallsyms])
ffffffff94c000d1 __softirqentry_text_start+0xd1 ([kernel.kallsyms])
ffffffff94075108 irq_exit+0xe8 ([kernel.kallsyms])
ffffffff94a01a69 do_IRQ+0x59 ([kernel.kallsyms])
ffffffff94a0098f ret_from_intr+0x0 ([kernel.kallsyms])
ffffffff9464e01d cpuidle_enter_state+0xbd ([kernel.kallsyms])
ffffffff9464e287 cpuidle_enter+0x17 ([kernel.kallsyms])
ffffffff940a3cd3 call_cpuidle+0x23 ([kernel.kallsyms])
ffffffff940a3f78 do_idle+0x1c8 ([kernel.kallsyms])
ffffffff940a4203 cpu_startup_entry+0x73 ([kernel.kallsyms])
ffffffff9403fade start_secondary+0x1ae ([kernel.kallsyms])
ffffffff940000d4 secondary_startup_64+0xa4 ([kernel.kallsyms])
However, I can't seem to determine why this is failing. It seems like
the only way to hit kfree_skb within ndisc_send_skb would be if
icmp6_dst_alloc fails?
Powered by blists - more mailing lists