[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9d636726-514b-417f-ab46-6f570a563eed@fastly.com>
Date: Fri, 9 Feb 2024 12:11:07 -0500
From: Suresh Bhogavilli <sbhogavilli@...tly.com>
To: "Nabil S. Alramli" <nalramli@...tly.com>, David Ahern
<dsahern@...nel.org>, davem@...emloft.net, edumazet@...gle.com,
kuba@...nel.org, pabeni@...hat.com, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Cc: jdamato@...tly.com, srao@...tly.com, dev@...ramli.com
Subject: Re: [net] ipv4: Fix broken PMTUD when using L4 multipath hash
Hi David,
On 10/16/23 2:51 PM, Nabil S. Alramli wrote:
> On 10/13/2023 12:19 PM, David Ahern wrote:
>>> On a node with multiple network interfaces, if we enable layer 4 hash
>>> policy with net.ipv4.fib_multipath_hash_policy=1, path MTU discovery is
>>> broken and TCP connection does not make progress unless the incoming
>>> ICMP Fragmentation Needed (type 3, code 4) message is received on the
>>> egress interface of selected nexthop of the socket.
>> known problem.
>>
>>> This is because build_sk_flow_key() does not provide the sport and dport
>>> from the socket when calling flowi4_init_output(). This appears to be a
>>> copy/paste error of build_skb_flow_key() -> __build_flow_key() ->
>>> flowi4_init_output() call used for packet forwarding where an skb is
>>> present, is passed later to fib_multipath_hash() call, and can scrape
>>> out both sport and dport from the skb if L4 hash policy is in use.
>> are you sure?
>>
>> As I recall the problem is that the ICMP can be received on a different
>> path. When it is processed, the exception is added to the ingress device
>> of the ICMP and not the device the original packet egressed. I have
>> scripts that somewhat reliably reproduced the problem; I started working
>> on a fix and got distracted.
> With net.ipv4.fib_multipath_hash_policy=1 (layer 4 hashing), when an
> ICMP packet too big (PTB) message is received on an interface different
> from the socket egress interface, we see a cache entry added to the
> ICMP ingress interface but with parameters matching the route entry
> rather than the MTU reported in the ICMP message.
>
> On the below node, ICMP PTB messages arrive on an interface named
> vlan100. With net.ipv4.fib_multipath_hash_policy=0 - layer3 hashing -
> the path from this cache to 139.162.188.91 is via another interface
> named vlan200.
>
> When the ICMP PTB message arrives on vlan100, an exception entry does
> get added to vlan200 and the socket's cached mtu gets updated too. TCP
> connection makes progress (not shown).
>
> sbhogavilli@...e20:~$ ip route sh cache 139.162.188.91 | head
> 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
> cache expires 363sec mtu 905 advmss 1460
> 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
> cache expires 363sec mtu 905 advmss 1460
>
> With net.ipv4.fib_multipath_hash_policy=1 (layer 4 hashing), when TCP
> traffic egresses over vlan200 (with ICMP PTB message arriving on vlan100
> still), the cache entry still shows mtu of 1500 on the TCP egress
> interface of vlan200. No exception entry gets added to vlan100 as you noted:
>
> sbhogavilli@...e20:~$ ip route sh cache 139.162.188.91 | head
> 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
> cache mtu 1500 advmss 1460
> 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
> cache mtu 1500 advmss 1460
>
> In this case, the TCP connection does not make progress, ultimately
> timing out.
>
> If we retry TCP connections until one uses vlan100 to egress, then the
> exception entry does get added with an MTU matching those reported in
> the ICMP PTB message:
>
> sbhogavilli@...e20:~$ ip route sh cache 139.162.188.91 | head
> 139.162.188.91 encap mpls 240583 via 172.18.144.1 dev vlan100
> cache expires 153sec mtu 905 advmss 1460
> 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
> cache mtu 1500 advmss 1460
>
> In this case the TCP connection over vlan100 does make progress.
>
> With the proposed patch applied, an exception entry does get created on
> the socket egress interface even when that is different from the ICMP
> PTB ingress interface. Below is the output after different TCP
> connections have used the two interfaces this node has:
>
> sbhogavilli@...e20:~$ ip route sh cache 139.162.188.91 | head
> 139.162.188.91 encap mpls 240583 via 172.18.144.1 dev vlan100
> cache expires 565sec mtu 905 advmss 1460
> 139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
> cache expires 562sec mtu 905 advmss 1460
>
> Thank you.
Does that answer your question? Do you need me to make any changes to
get your Reviewed-by?
Best regards,
Suresh
Powered by blists - more mailing lists