[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4be64c29-f495-4fdb-a565-2540745d5412@fastly.com>
Date: Mon, 16 Oct 2023 14:51:21 -0400
From: "Nabil S. Alramli" <nalramli@...tly.com>
To: David Ahern <dsahern@...nel.org>, sbhogavilli@...tly.com,
davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org,
pabeni@...hat.com, netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Cc: jdamato@...tly.com, srao@...tly.com, dev@...ramli.com
Subject: Re: [net] ipv4: Fix broken PMTUD when using L4 multipath hash
Hi David,
Thank you for your quick response.
On 10/13/2023 12:19 PM, David Ahern wrote:
> On 10/12/23 5:40 PM, Nabil S. Alramli wrote:
>> From: Suresh Bhogavilli <sbhogavilli@...tly.com>
>>
>> On a node with multiple network interfaces, if we enable layer 4 hash
>> policy with net.ipv4.fib_multipath_hash_policy=1, path MTU discovery is
>> broken and TCP connection does not make progress unless the incoming
>> ICMP Fragmentation Needed (type 3, code 4) message is received on the
>> egress interface of selected nexthop of the socket.
>
> known problem.
>
>>
>> This is because build_sk_flow_key() does not provide the sport and dport
>> from the socket when calling flowi4_init_output(). This appears to be a
>> copy/paste error of build_skb_flow_key() -> __build_flow_key() ->
>> flowi4_init_output() call used for packet forwarding where an skb is
>> present, is passed later to fib_multipath_hash() call, and can scrape
>> out both sport and dport from the skb if L4 hash policy is in use.
>
> are you sure?
>
> As I recall the problem is that the ICMP can be received on a different
> path. When it is processed, the exception is added to the ingress device
> of the ICMP and not the device the original packet egressed. I have
> scripts that somewhat reliably reproduced the problem; I started working
> on a fix and got distracted.
With net.ipv4.fib_multipath_hash_policy=1 (layer 4 hashing), when an
ICMP packet too big (PTB) message is received on an interface different
from the socket egress interface, we see a cache entry added to the
ICMP ingress interface but with parameters matching the route entry
rather than the MTU reported in the ICMP message.
On the below node, ICMP PTB messages arrive on an interface named
vlan100. With net.ipv4.fib_multipath_hash_policy=0 - layer3 hashing -
the path from this cache to 139.162.188.91 is via another interface
named vlan200.
When the ICMP PTB message arrives on vlan100, an exception entry does
get added to vlan200 and the socket's cached mtu gets updated too. TCP
connection makes progress (not shown).
sbhogavilli@...e20:~$ ip route sh cache 139.162.188.91 | head
139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
cache expires 363sec mtu 905 advmss 1460
139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
cache expires 363sec mtu 905 advmss 1460
With net.ipv4.fib_multipath_hash_policy=1 (layer 4 hashing), when TCP
traffic egresses over vlan200 (with ICMP PTB message arriving on vlan100
still), the cache entry still shows mtu of 1500 on the TCP egress
interface of vlan200. No exception entry gets added to vlan100 as you noted:
sbhogavilli@...e20:~$ ip route sh cache 139.162.188.91 | head
139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
cache mtu 1500 advmss 1460
139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
cache mtu 1500 advmss 1460
In this case, the TCP connection does not make progress, ultimately
timing out.
If we retry TCP connections until one uses vlan100 to egress, then the
exception entry does get added with an MTU matching those reported in
the ICMP PTB message:
sbhogavilli@...e20:~$ ip route sh cache 139.162.188.91 | head
139.162.188.91 encap mpls 240583 via 172.18.144.1 dev vlan100
cache expires 153sec mtu 905 advmss 1460
139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
cache mtu 1500 advmss 1460
In this case the TCP connection over vlan100 does make progress.
With the proposed patch applied, an exception entry does get created on
the socket egress interface even when that is different from the ICMP
PTB ingress interface. Below is the output after different TCP
connections have used the two interfaces this node has:
sbhogavilli@...e20:~$ ip route sh cache 139.162.188.91 | head
139.162.188.91 encap mpls 240583 via 172.18.144.1 dev vlan100
cache expires 565sec mtu 905 advmss 1460
139.162.188.91 encap mpls 152702 via 172.18.146.1 dev vlan200
cache expires 562sec mtu 905 advmss 1460
Thank you.
Powered by blists - more mailing lists