lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 14 Jul 2021 18:51:42 +0100
From:   Vadim Fedorenko <vfedorenko@...ek.ru>
To:     Ido Schimmel <idosch@...sch.org>
Cc:     Stephen Hemminger <stephen@...workplumber.org>,
        netdev@...r.kernel.org
Subject: Re: Fw: [Bug 213729] New: PMTUD failure with ECMP.

On 14.07.2021 17:30, Ido Schimmel wrote:
> On Wed, Jul 14, 2021 at 05:11:45PM +0100, Vadim Fedorenko wrote:
>> On 14.07.2021 16:13, Stephen Hemminger wrote:
>>>
>>>
>>> Begin forwarded message:
>>>
>>> Date: Wed, 14 Jul 2021 13:43:51 +0000
>>> From: bugzilla-daemon@...zilla.kernel.org
>>> To: stephen@...workplumber.org
>>> Subject: [Bug 213729] New: PMTUD failure with ECMP.
>>>
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=213729
>>>
>>>               Bug ID: 213729
>>>              Summary: PMTUD failure with ECMP.
>>>              Product: Networking
>>>              Version: 2.5
>>>       Kernel Version: 5.13.0-rc5
>>>             Hardware: x86-64
>>>                   OS: Linux
>>>                 Tree: Mainline
>>>               Status: NEW
>>>             Severity: normal
>>>             Priority: P1
>>>            Component: IPV4
>>>             Assignee: stephen@...workplumber.org
>>>             Reporter: skappen@...sta.com
>>>           Regression: No
>>>
>>> Created attachment 297849
>>>     --> https://bugzilla.kernel.org/attachment.cgi?id=297849&action=edit
>>> Ecmp pmtud test setup
>>>
>>> PMTUD failure with ECMP.
>>>
>>> We have observed failures when PMTUD and ECMP work together.
>>> Ping fails either through gateway1 or gateway2 when using MTU greater than
>>> 1500.
>>> The Issue has been tested and reproduced on CentOS 8 and mainline kernels.
>>>
>>>
>>> Kernel versions:
>>> [root@...alhost ~]# uname -a
>>> Linux localhost.localdomain 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33
>>> UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> [root@...alhost skappen]# uname -a
>>> Linux localhost.localdomain 5.13.0-rc5 #2 SMP Thu Jun 10 05:06:28 EDT 2021
>>> x86_64 x86_64 x86_64 GNU/Linux
>>>
>>>
>>> Static routes with ECMP are configured like this:
>>>
>>> [root@...alhost skappen]#ip route
>>> default proto static
>>>           nexthop via 192.168.0.11 dev enp0s3 weight 1
>>>           nexthop via 192.168.0.12 dev enp0s3 weight 1
>>> 192.168.0.0/24 dev enp0s3 proto kernel scope link src 192.168.0.4 metric 100
>>>
>>> So the host would pick the first or the second nexthop depending on ECMP's
>>> hashing algorithm.
>>>
>>> When pinging the destination with MTU greater than 1500 it works through the
>>> first gateway.
>>>
>>> [root@...alhost skappen]# ping -s1700 10.0.3.17
>>> PING 10.0.3.17 (10.0.3.17) 1700(1728) bytes of data.
>>>   From 192.168.0.11 icmp_seq=1 Frag needed and DF set (mtu = 1500)
>>> 1708 bytes from 10.0.3.17: icmp_seq=2 ttl=63 time=0.880 ms
>>> 1708 bytes from 10.0.3.17: icmp_seq=3 ttl=63 time=1.26 ms
>>> ^C
>>> --- 10.0.3.17 ping statistics ---
>>> 3 packets transmitted, 2 received, +1 errors, 33.3333% packet loss, time 2003ms
>>> rtt min/avg/max/mdev = 0.880/1.067/1.255/0.190 ms
>>>
>>> The MTU also gets cached for this route as per rfc6754:
>>>
>>> [root@...alhost skappen]# ip route get 10.0.3.17
>>> 10.0.3.17 via 192.168.0.11 dev enp0s3 src 192.168.0.4 uid 0
>>>       cache expires 540sec mtu 1500
>>>
>>> [root@...alhost skappen]# tracepath -n 10.0.3.17
>>>    1?: [LOCALHOST]                      pmtu 1500
>>>    1:  192.168.0.11                                          1.475ms
>>>    1:  192.168.0.11                                          0.995ms
>>>    2:  192.168.0.11                                          1.075ms !H
>>>        Resume: pmtu 1500
>>>
>>> However when the second nexthop is picked PMTUD breaks. In this example I ping
>>> a second interface configured on the same destination
>>> from the same host, using the same routes and gateways. Based on ECMP's hashing
>>> algorithm this host would pick the second nexthop (.2):
>>>
>>> [root@...alhost skappen]# ping -s1700 10.0.3.18
>>> PING 10.0.3.18 (10.0.3.18) 1700(1728) bytes of data.
>>>   From 192.168.0.12 icmp_seq=1 Frag needed and DF set (mtu = 1500)
>>>   From 192.168.0.12 icmp_seq=2 Frag needed and DF set (mtu = 1500)
>>>   From 192.168.0.12 icmp_seq=3 Frag needed and DF set (mtu = 1500)
>>> ^C
>>> --- 10.0.3.18 ping statistics ---
>>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2062ms
>>> [root@...alhost skappen]# ip route get 10.0.3.18
>>> 10.0.3.18 via 192.168.0.12 dev enp0s3 src 192.168.0.4 uid 0
>>>       cache
>>>
>>> [root@...alhost skappen]# tracepath -n 10.0.3.18
>>>    1?: [LOCALHOST]                      pmtu 9000
>>>    1:  192.168.0.12                                          3.147ms
>>>    1:  192.168.0.12                                          0.696ms
>>>    2:  192.168.0.12                                          0.648ms pmtu 1500
>>>    2:  192.168.0.12                                          0.761ms !H
>>>        Resume: pmtu 1500
>>>
>>> The ICMP frag needed reaches the host, but in this case it is ignored.
>>> The MTU for this route does not get cached either.
>>>
>>>
>>> It looks like mtu value from the next hop is not properly updated for some
>>> reason.
>>>
>>>
>>> Test Case:
>>> Create 2 networks: Internal, External
>>> Create 4 virtual machines: Client, GW-1, GW-2, Destination
>>>
>>> Client
>>> configure 1 NIC to internal with MTU 9000
>>> configure static route with ECMP to GW-1 and GW-2 internal address
>>>
>>> GW-1, GW-2
>>> configure 2 NICs
>>> - to internal with MTU 9000
>>> - to external MTU 1500
>>> - enable ip_forward
>>> - enable packet forward
>>>
>>> Target
>>> configure 1 NIC to external MTU with 1500
>>> configure multiple IP address(say IP1, IP2, IP3, IP4) on the same interface, so
>>> ECMP's hashing algorithm would pick different routes
>>>
>>> Test
>>> ping from client to target with larger than 1500 bytes
>>> ping the other addresses of the target so ECMP would use the other route too
>>>
>>> Results observed:
>>> Through GW-1 PMTUD works, after the first frag needed message the MTU is
>>> lowered on the client side for this target. Through the GW-2 PMTUD does not,
>>> all responses to ping are ICMP frag needed, which are not obeyed by the kernel.
>>> In all failure cases mtu is not cashed on "ip route get".
>>>
>> Looks like I'm in context of PMTU and also I'm working on implementing several
>> new test cases for pmtu.sh test, so I will take care of this one too
> 
> Thanks
> 
> There was a similar report from around a year ago that might give you
> more info:
> 
> https://lore.kernel.org/netdev/CANXY5y+iuzMg+4UdkPJW_Efun30KAPL1+h2S7HeSPp4zOrVC7g@mail.gmail.com/
> 

Thanks Ido, will definitely look at it!

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ