netdev - Re: Fw: [Bug 213729] New: PMTUD failure with ECMP.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55939c8e-8149-1c12-6464-df60933194ac@gmail.com>
Date:   Wed, 14 Jul 2021 12:12:25 -0600
From:   David Ahern <dsahern@...il.com>
To:     Vadim Fedorenko <vfedorenko@...ek.ru>,
        Ido Schimmel <idosch@...sch.org>
Cc:     Stephen Hemminger <stephen@...workplumber.org>,
        netdev@...r.kernel.org
Subject: Re: Fw: [Bug 213729] New: PMTUD failure with ECMP.

On 7/14/21 11:51 AM, Vadim Fedorenko wrote:
> On 14.07.2021 17:30, Ido Schimmel wrote:
>> On Wed, Jul 14, 2021 at 05:11:45PM +0100, Vadim Fedorenko wrote:
>>> On 14.07.2021 16:13, Stephen Hemminger wrote:
>>>>
>>>>
>>>> Begin forwarded message:
>>>>
>>>> Date: Wed, 14 Jul 2021 13:43:51 +0000
>>>> From: bugzilla-daemon@...zilla.kernel.org
>>>> To: stephen@...workplumber.org
>>>> Subject: [Bug 213729] New: PMTUD failure with ECMP.
>>>>
>>>>
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=213729
>>>>
>>>>               Bug ID: 213729
>>>>              Summary: PMTUD failure with ECMP.
>>>>              Product: Networking
>>>>              Version: 2.5
>>>>       Kernel Version: 5.13.0-rc5
>>>>             Hardware: x86-64
>>>>                   OS: Linux
>>>>                 Tree: Mainline
>>>>               Status: NEW
>>>>             Severity: normal
>>>>             Priority: P1
>>>>            Component: IPV4
>>>>             Assignee: stephen@...workplumber.org
>>>>             Reporter: skappen@...sta.com
>>>>           Regression: No
>>>>
>>>> Created attachment 297849
>>>>     -->
>>>> https://bugzilla.kernel.org/attachment.cgi?id=297849&action=edit
>>>> Ecmp pmtud test setup
>>>>
>>>> PMTUD failure with ECMP.
>>>>
>>>> We have observed failures when PMTUD and ECMP work together.
>>>> Ping fails either through gateway1 or gateway2 when using MTU
>>>> greater than
>>>> 1500.
>>>> The Issue has been tested and reproduced on CentOS 8 and mainline
>>>> kernels.
>>>>
>>>>
>>>> Kernel versions:
>>>> [root@...alhost ~]# uname -a
>>>> Linux localhost.localdomain 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun
>>>> 1 16:14:33
>>>> UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
>>>>
>>>> [root@...alhost skappen]# uname -a
>>>> Linux localhost.localdomain 5.13.0-rc5 #2 SMP Thu Jun 10 05:06:28
>>>> EDT 2021
>>>> x86_64 x86_64 x86_64 GNU/Linux
>>>>
>>>>
>>>> Static routes with ECMP are configured like this:
>>>>
>>>> [root@...alhost skappen]#ip route
>>>> default proto static
>>>>           nexthop via 192.168.0.11 dev enp0s3 weight 1
>>>>           nexthop via 192.168.0.12 dev enp0s3 weight 1
>>>> 192.168.0.0/24 dev enp0s3 proto kernel scope link src 192.168.0.4
>>>> metric 100
>>>>
>>>> So the host would pick the first or the second nexthop depending on
>>>> ECMP's
>>>> hashing algorithm.
>>>>
>>>> When pinging the destination with MTU greater than 1500 it works
>>>> through the
>>>> first gateway.
>>>>
>>>> [root@...alhost skappen]# ping -s1700 10.0.3.17
>>>> PING 10.0.3.17 (10.0.3.17) 1700(1728) bytes of data.
>>>>   From 192.168.0.11 icmp_seq=1 Frag needed and DF set (mtu = 1500)
>>>> 1708 bytes from 10.0.3.17: icmp_seq=2 ttl=63 time=0.880 ms
>>>> 1708 bytes from 10.0.3.17: icmp_seq=3 ttl=63 time=1.26 ms
>>>> ^C
>>>> --- 10.0.3.17 ping statistics ---
>>>> 3 packets transmitted, 2 received, +1 errors, 33.3333% packet loss,
>>>> time 2003ms
>>>> rtt min/avg/max/mdev = 0.880/1.067/1.255/0.190 ms
>>>>
>>>> The MTU also gets cached for this route as per rfc6754:
>>>>
>>>> [root@...alhost skappen]# ip route get 10.0.3.17
>>>> 10.0.3.17 via 192.168.0.11 dev enp0s3 src 192.168.0.4 uid 0
>>>>       cache expires 540sec mtu 1500
>>>>
>>>> [root@...alhost skappen]# tracepath -n 10.0.3.17
>>>>    1?: [LOCALHOST]                      pmtu 1500
>>>>    1:  192.168.0.11                                          1.475ms
>>>>    1:  192.168.0.11                                          0.995ms
>>>>    2:  192.168.0.11                                          1.075ms !H
>>>>        Resume: pmtu 1500
>>>>
>>>> However when the second nexthop is picked PMTUD breaks. In this
>>>> example I ping
>>>> a second interface configured on the same destination
>>>> from the same host, using the same routes and gateways. Based on
>>>> ECMP's hashing
>>>> algorithm this host would pick the second nexthop (.2):
>>>>
>>>> [root@...alhost skappen]# ping -s1700 10.0.3.18
>>>> PING 10.0.3.18 (10.0.3.18) 1700(1728) bytes of data.
>>>>   From 192.168.0.12 icmp_seq=1 Frag needed and DF set (mtu = 1500)
>>>>   From 192.168.0.12 icmp_seq=2 Frag needed and DF set (mtu = 1500)
>>>>   From 192.168.0.12 icmp_seq=3 Frag needed and DF set (mtu = 1500)
>>>> ^C
>>>> --- 10.0.3.18 ping statistics ---
>>>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time
>>>> 2062ms
>>>> [root@...alhost skappen]# ip route get 10.0.3.18
>>>> 10.0.3.18 via 192.168.0.12 dev enp0s3 src 192.168.0.4 uid 0
>>>>       cache
>>>>
>>>> [root@...alhost skappen]# tracepath -n 10.0.3.18
>>>>    1?: [LOCALHOST]                      pmtu 9000
>>>>    1:  192.168.0.12                                          3.147ms
>>>>    1:  192.168.0.12                                          0.696ms
>>>>    2:  192.168.0.12                                          0.648ms
>>>> pmtu 1500
>>>>    2:  192.168.0.12                                          0.761ms !H
>>>>        Resume: pmtu 1500
>>>>
>>>> The ICMP frag needed reaches the host, but in this case it is ignored.
>>>> The MTU for this route does not get cached either.
>>>>
>>>>
>>>> It looks like mtu value from the next hop is not properly updated
>>>> for some
>>>> reason.
>>>>
>>>>
>>>> Test Case:
>>>> Create 2 networks: Internal, External
>>>> Create 4 virtual machines: Client, GW-1, GW-2, Destination
>>>>
>>>> Client
>>>> configure 1 NIC to internal with MTU 9000
>>>> configure static route with ECMP to GW-1 and GW-2 internal address
>>>>
>>>> GW-1, GW-2
>>>> configure 2 NICs
>>>> - to internal with MTU 9000
>>>> - to external MTU 1500
>>>> - enable ip_forward
>>>> - enable packet forward
>>>>
>>>> Target
>>>> configure 1 NIC to external MTU with 1500
>>>> configure multiple IP address(say IP1, IP2, IP3, IP4) on the same
>>>> interface, so
>>>> ECMP's hashing algorithm would pick different routes
>>>>
>>>> Test
>>>> ping from client to target with larger than 1500 bytes
>>>> ping the other addresses of the target so ECMP would use the other
>>>> route too
>>>>
>>>> Results observed:
>>>> Through GW-1 PMTUD works, after the first frag needed message the
>>>> MTU is
>>>> lowered on the client side for this target. Through the GW-2 PMTUD
>>>> does not,
>>>> all responses to ping are ICMP frag needed, which are not obeyed by
>>>> the kernel.
>>>> In all failure cases mtu is not cashed on "ip route get".
>>>>
>>> Looks like I'm in context of PMTU and also I'm working on
>>> implementing several
>>> new test cases for pmtu.sh test, so I will take care of this one too
>>
>> Thanks
>>
>> There was a similar report from around a year ago that might give you
>> more info:
>>
>> https://lore.kernel.org/netdev/CANXY5y+iuzMg+4UdkPJW_Efun30KAPL1+h2S7HeSPp4zOrVC7g@mail.gmail.com/
>>
>>
> 
> Thanks Ido, will definitely look at it!

I believe that one is fixed by 2fbc6e89b2f1403189e624cabaf73e189c5e50c6

The root cause of this problem is icmp's taking a path that the original
packet did not. i.e., the ICMP is received on device 1 and the exception
is created on that device but Rx chooses device 2 (a different leg in
the ECMP).