[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210714081318.40500a1b@hermes.local>
Date: Wed, 14 Jul 2021 08:13:18 -0700
From: Stephen Hemminger <stephen@...workplumber.org>
To: netdev@...r.kernel.org
Subject: Fw: [Bug 213729] New: PMTUD failure with ECMP.
Begin forwarded message:
Date: Wed, 14 Jul 2021 13:43:51 +0000
From: bugzilla-daemon@...zilla.kernel.org
To: stephen@...workplumber.org
Subject: [Bug 213729] New: PMTUD failure with ECMP.
https://bugzilla.kernel.org/show_bug.cgi?id=213729
Bug ID: 213729
Summary: PMTUD failure with ECMP.
Product: Networking
Version: 2.5
Kernel Version: 5.13.0-rc5
Hardware: x86-64
OS: Linux
Tree: Mainline
Status: NEW
Severity: normal
Priority: P1
Component: IPV4
Assignee: stephen@...workplumber.org
Reporter: skappen@...sta.com
Regression: No
Created attachment 297849
--> https://bugzilla.kernel.org/attachment.cgi?id=297849&action=edit
Ecmp pmtud test setup
PMTUD failure with ECMP.
We have observed failures when PMTUD and ECMP work together.
Ping fails either through gateway1 or gateway2 when using MTU greater than
1500.
The Issue has been tested and reproduced on CentOS 8 and mainline kernels.
Kernel versions:
[root@...alhost ~]# uname -a
Linux localhost.localdomain 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33
UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[root@...alhost skappen]# uname -a
Linux localhost.localdomain 5.13.0-rc5 #2 SMP Thu Jun 10 05:06:28 EDT 2021
x86_64 x86_64 x86_64 GNU/Linux
Static routes with ECMP are configured like this:
[root@...alhost skappen]#ip route
default proto static
nexthop via 192.168.0.11 dev enp0s3 weight 1
nexthop via 192.168.0.12 dev enp0s3 weight 1
192.168.0.0/24 dev enp0s3 proto kernel scope link src 192.168.0.4 metric 100
So the host would pick the first or the second nexthop depending on ECMP's
hashing algorithm.
When pinging the destination with MTU greater than 1500 it works through the
first gateway.
[root@...alhost skappen]# ping -s1700 10.0.3.17
PING 10.0.3.17 (10.0.3.17) 1700(1728) bytes of data.
From 192.168.0.11 icmp_seq=1 Frag needed and DF set (mtu = 1500)
1708 bytes from 10.0.3.17: icmp_seq=2 ttl=63 time=0.880 ms
1708 bytes from 10.0.3.17: icmp_seq=3 ttl=63 time=1.26 ms
^C
--- 10.0.3.17 ping statistics ---
3 packets transmitted, 2 received, +1 errors, 33.3333% packet loss, time 2003ms
rtt min/avg/max/mdev = 0.880/1.067/1.255/0.190 ms
The MTU also gets cached for this route as per rfc6754:
[root@...alhost skappen]# ip route get 10.0.3.17
10.0.3.17 via 192.168.0.11 dev enp0s3 src 192.168.0.4 uid 0
cache expires 540sec mtu 1500
[root@...alhost skappen]# tracepath -n 10.0.3.17
1?: [LOCALHOST] pmtu 1500
1: 192.168.0.11 1.475ms
1: 192.168.0.11 0.995ms
2: 192.168.0.11 1.075ms !H
Resume: pmtu 1500
However when the second nexthop is picked PMTUD breaks. In this example I ping
a second interface configured on the same destination
from the same host, using the same routes and gateways. Based on ECMP's hashing
algorithm this host would pick the second nexthop (.2):
[root@...alhost skappen]# ping -s1700 10.0.3.18
PING 10.0.3.18 (10.0.3.18) 1700(1728) bytes of data.
From 192.168.0.12 icmp_seq=1 Frag needed and DF set (mtu = 1500)
From 192.168.0.12 icmp_seq=2 Frag needed and DF set (mtu = 1500)
From 192.168.0.12 icmp_seq=3 Frag needed and DF set (mtu = 1500)
^C
--- 10.0.3.18 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2062ms
[root@...alhost skappen]# ip route get 10.0.3.18
10.0.3.18 via 192.168.0.12 dev enp0s3 src 192.168.0.4 uid 0
cache
[root@...alhost skappen]# tracepath -n 10.0.3.18
1?: [LOCALHOST] pmtu 9000
1: 192.168.0.12 3.147ms
1: 192.168.0.12 0.696ms
2: 192.168.0.12 0.648ms pmtu 1500
2: 192.168.0.12 0.761ms !H
Resume: pmtu 1500
The ICMP frag needed reaches the host, but in this case it is ignored.
The MTU for this route does not get cached either.
It looks like mtu value from the next hop is not properly updated for some
reason.
Test Case:
Create 2 networks: Internal, External
Create 4 virtual machines: Client, GW-1, GW-2, Destination
Client
configure 1 NIC to internal with MTU 9000
configure static route with ECMP to GW-1 and GW-2 internal address
GW-1, GW-2
configure 2 NICs
- to internal with MTU 9000
- to external MTU 1500
- enable ip_forward
- enable packet forward
Target
configure 1 NIC to external MTU with 1500
configure multiple IP address(say IP1, IP2, IP3, IP4) on the same interface, so
ECMP's hashing algorithm would pick different routes
Test
ping from client to target with larger than 1500 bytes
ping the other addresses of the target so ECMP would use the other route too
Results observed:
Through GW-1 PMTUD works, after the first frag needed message the MTU is
lowered on the client side for this target. Through the GW-2 PMTUD does not,
all responses to ping are ICMP frag needed, which are not obeyed by the kernel.
In all failure cases mtu is not cashed on "ip route get".
--
You may reply to this email to add a comment.
You are receiving this mail because:
You are the assignee for the bug.
Powered by blists - more mailing lists