[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ9FNxvx_74Yc6yEkv9ENGHCTLrCd7ZrrecNKuwURFsA0odbpg@mail.gmail.com>
Date: Tue, 20 Jul 2021 04:59:36 +0530
From: Sam Kappen <skappen@...sta.com>
To: David Ahern <dsahern@...il.com>
Cc: Vadim Fedorenko <vfedorenko@...ek.ru>,
Ido Schimmel <idosch@...sch.org>,
Stephen Hemminger <stephen@...workplumber.org>,
netdev@...r.kernel.org
Subject: Re: Fw: [Bug 213729] New: PMTUD failure with ECMP.
On Wed, Jul 14, 2021 at 11:42 PM David Ahern <dsahern@...il.com> wrote:
>
> On 7/14/21 11:51 AM, Vadim Fedorenko wrote:
> > On 14.07.2021 17:30, Ido Schimmel wrote:
> >> On Wed, Jul 14, 2021 at 05:11:45PM +0100, Vadim Fedorenko wrote:
> >>> On 14.07.2021 16:13, Stephen Hemminger wrote:
> >>>>
> >>>>
> >>>> Begin forwarded message:
> >>>>
> >>>> Date: Wed, 14 Jul 2021 13:43:51 +0000
> >>>> From: bugzilla-daemon@...zilla.kernel.org
> >>>> To: stephen@...workplumber.org
> >>>> Subject: [Bug 213729] New: PMTUD failure with ECMP.
> >>>>
> >>>>
> >>>> https://bugzilla.kernel.org/show_bug.cgi?id=213729
> >>>>
> >>>> Bug ID: 213729
> >>>> Summary: PMTUD failure with ECMP.
> >>>> Product: Networking
> >>>> Version: 2.5
> >>>> Kernel Version: 5.13.0-rc5
> >>>> Hardware: x86-64
> >>>> OS: Linux
> >>>> Tree: Mainline
> >>>> Status: NEW
> >>>> Severity: normal
> >>>> Priority: P1
> >>>> Component: IPV4
> >>>> Assignee: stephen@...workplumber.org
> >>>> Reporter: skappen@...sta.com
> >>>> Regression: No
> >>>>
> >>>> Created attachment 297849
> >>>> -->
> >>>> https://bugzilla.kernel.org/attachment.cgi?id=297849&action=edit
> >>>> Ecmp pmtud test setup
> >>>>
> >>>> PMTUD failure with ECMP.
> >>>>
> >>>> We have observed failures when PMTUD and ECMP work together.
> >>>> Ping fails either through gateway1 or gateway2 when using MTU
> >>>> greater than
> >>>> 1500.
> >>>> The Issue has been tested and reproduced on CentOS 8 and mainline
> >>>> kernels.
> >>>>
> >>>>
> >>>> Kernel versions:
> >>>> [root@...alhost ~]# uname -a
> >>>> Linux localhost.localdomain 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun
> >>>> 1 16:14:33
> >>>> UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> >>>>
> >>>> [root@...alhost skappen]# uname -a
> >>>> Linux localhost.localdomain 5.13.0-rc5 #2 SMP Thu Jun 10 05:06:28
> >>>> EDT 2021
> >>>> x86_64 x86_64 x86_64 GNU/Linux
> >>>>
> >>>>
> >>>> Static routes with ECMP are configured like this:
> >>>>
> >>>> [root@...alhost skappen]#ip route
> >>>> default proto static
> >>>> nexthop via 192.168.0.11 dev enp0s3 weight 1
> >>>> nexthop via 192.168.0.12 dev enp0s3 weight 1
> >>>> 192.168.0.0/24 dev enp0s3 proto kernel scope link src 192.168.0.4
> >>>> metric 100
> >>>>
> >>>> So the host would pick the first or the second nexthop depending on
> >>>> ECMP's
> >>>> hashing algorithm.
> >>>>
> >>>> When pinging the destination with MTU greater than 1500 it works
> >>>> through the
> >>>> first gateway.
> >>>>
> >>>> [root@...alhost skappen]# ping -s1700 10.0.3.17
> >>>> PING 10.0.3.17 (10.0.3.17) 1700(1728) bytes of data.
> >>>> From 192.168.0.11 icmp_seq=1 Frag needed and DF set (mtu = 1500)
> >>>> 1708 bytes from 10.0.3.17: icmp_seq=2 ttl=63 time=0.880 ms
> >>>> 1708 bytes from 10.0.3.17: icmp_seq=3 ttl=63 time=1.26 ms
> >>>> ^C
> >>>> --- 10.0.3.17 ping statistics ---
> >>>> 3 packets transmitted, 2 received, +1 errors, 33.3333% packet loss,
> >>>> time 2003ms
> >>>> rtt min/avg/max/mdev = 0.880/1.067/1.255/0.190 ms
> >>>>
> >>>> The MTU also gets cached for this route as per rfc6754:
> >>>>
> >>>> [root@...alhost skappen]# ip route get 10.0.3.17
> >>>> 10.0.3.17 via 192.168.0.11 dev enp0s3 src 192.168.0.4 uid 0
> >>>> cache expires 540sec mtu 1500
> >>>>
> >>>> [root@...alhost skappen]# tracepath -n 10.0.3.17
> >>>> 1?: [LOCALHOST] pmtu 1500
> >>>> 1: 192.168.0.11 1.475ms
> >>>> 1: 192.168.0.11 0.995ms
> >>>> 2: 192.168.0.11 1.075ms !H
> >>>> Resume: pmtu 1500
> >>>>
> >>>> However when the second nexthop is picked PMTUD breaks. In this
> >>>> example I ping
> >>>> a second interface configured on the same destination
> >>>> from the same host, using the same routes and gateways. Based on
> >>>> ECMP's hashing
> >>>> algorithm this host would pick the second nexthop (.2):
> >>>>
> >>>> [root@...alhost skappen]# ping -s1700 10.0.3.18
> >>>> PING 10.0.3.18 (10.0.3.18) 1700(1728) bytes of data.
> >>>> From 192.168.0.12 icmp_seq=1 Frag needed and DF set (mtu = 1500)
> >>>> From 192.168.0.12 icmp_seq=2 Frag needed and DF set (mtu = 1500)
> >>>> From 192.168.0.12 icmp_seq=3 Frag needed and DF set (mtu = 1500)
> >>>> ^C
> >>>> --- 10.0.3.18 ping statistics ---
> >>>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time
> >>>> 2062ms
> >>>> [root@...alhost skappen]# ip route get 10.0.3.18
> >>>> 10.0.3.18 via 192.168.0.12 dev enp0s3 src 192.168.0.4 uid 0
> >>>> cache
> >>>>
> >>>> [root@...alhost skappen]# tracepath -n 10.0.3.18
> >>>> 1?: [LOCALHOST] pmtu 9000
> >>>> 1: 192.168.0.12 3.147ms
> >>>> 1: 192.168.0.12 0.696ms
> >>>> 2: 192.168.0.12 0.648ms
> >>>> pmtu 1500
> >>>> 2: 192.168.0.12 0.761ms !H
> >>>> Resume: pmtu 1500
> >>>>
> >>>> The ICMP frag needed reaches the host, but in this case it is ignored.
> >>>> The MTU for this route does not get cached either.
> >>>>
> >>>>
> >>>> It looks like mtu value from the next hop is not properly updated
> >>>> for some
> >>>> reason.
> >>>>
> >>>>
> >>>> Test Case:
> >>>> Create 2 networks: Internal, External
> >>>> Create 4 virtual machines: Client, GW-1, GW-2, Destination
> >>>>
> >>>> Client
> >>>> configure 1 NIC to internal with MTU 9000
> >>>> configure static route with ECMP to GW-1 and GW-2 internal address
> >>>>
> >>>> GW-1, GW-2
> >>>> configure 2 NICs
> >>>> - to internal with MTU 9000
> >>>> - to external MTU 1500
> >>>> - enable ip_forward
> >>>> - enable packet forward
> >>>>
> >>>> Target
> >>>> configure 1 NIC to external MTU with 1500
> >>>> configure multiple IP address(say IP1, IP2, IP3, IP4) on the same
> >>>> interface, so
> >>>> ECMP's hashing algorithm would pick different routes
> >>>>
> >>>> Test
> >>>> ping from client to target with larger than 1500 bytes
> >>>> ping the other addresses of the target so ECMP would use the other
> >>>> route too
> >>>>
> >>>> Results observed:
> >>>> Through GW-1 PMTUD works, after the first frag needed message the
> >>>> MTU is
> >>>> lowered on the client side for this target. Through the GW-2 PMTUD
> >>>> does not,
> >>>> all responses to ping are ICMP frag needed, which are not obeyed by
> >>>> the kernel.
> >>>> In all failure cases mtu is not cashed on "ip route get".
> >>>>
> >>> Looks like I'm in context of PMTU and also I'm working on
> >>> implementing several
> >>> new test cases for pmtu.sh test, so I will take care of this one too
> >>
> >> Thanks
> >>
> >> There was a similar report from around a year ago that might give you
> >> more info:
> >>
> >> https://lore.kernel.org/netdev/CANXY5y+iuzMg+4UdkPJW_Efun30KAPL1+h2S7HeSPp4zOrVC7g@mail.gmail.com/
> >>
> >>
> >
> > Thanks Ido, will definitely look at it!
>
> I believe that one is fixed by 2fbc6e89b2f1403189e624cabaf73e189c5e50c6
>
> The root cause of this problem is icmp's taking a path that the original
> packet did not. i.e., the ICMP is received on device 1 and the exception
> is created on that device but Rx chooses device 2 (a different leg in
> the ECMP).
Actual test was carried out in 5.13.0-rc5 kernel and also tested
5.14-rc1 kernel as well. This Issue is still reproduced.
Powered by blists - more mailing lists