netdev - Re: [PATCH net-next 1/3] udp_tunnel: allow to turn off path mtu discovery on encap sockets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200719234940.37adebe7@elisabeth>
Date:   Sun, 19 Jul 2020 23:49:40 +0200
From:   Stefano Brivio <sbrivio@...hat.com>
To:     David Ahern <dsahern@...il.com>
Cc:     Florian Westphal <fw@...len.de>, netdev@...r.kernel.org,
        aconole@...hat.com
Subject: Re: [PATCH net-next 1/3] udp_tunnel: allow to turn off path mtu
 discovery on encap sockets

On Sun, 19 Jul 2020 12:43:55 -0600
David Ahern <dsahern@...il.com> wrote:

> On 7/18/20 11:58 AM, Stefano Brivio wrote:
> > On Sat, 18 Jul 2020 11:02:46 -0600
> > David Ahern <dsahern@...il.com> wrote:
> >   
> >> On 7/18/20 12:56 AM, Stefano Brivio wrote:  
> >>> On Fri, 17 Jul 2020 09:04:51 -0600
> >>> David Ahern <dsahern@...il.com> wrote:
> >>>     
> >>>> On 7/17/20 6:27 AM, Stefano Brivio wrote:    
> >>>>>>      
> >>>>>>> Note that this doesn't work as it is because of a number of reasons
> >>>>>>> (skb doesn't have a dst, pkt_type is not PACKET_HOST), and perhaps we
> >>>>>>> shouldn't be using icmp_send(), but at a glance that looks simpler.        
> >>>>>>
> >>>>>> Yes, it also requires that the bridge has IP connectivity
> >>>>>> to reach the inner ip, which might not be the case.      
> >>>>>
> >>>>> If the VXLAN endpoint is a port of the bridge, that needs to be the
> >>>>> case, right? Otherwise the VXLAN endpoint can't be reached.
> >>>>>       
> >>>>>>> Another slight preference I have towards this idea is that the only
> >>>>>>> known way we can break PMTU discovery right now is by using a bridge,
> >>>>>>> so fixing the problem there looks more future-proof than addressing any
> >>>>>>> kind of tunnel with this problem. I think FoU and GUE would hit the
> >>>>>>> same problem, I don't know about IP tunnels, sticking that selftest
> >>>>>>> snippet to whatever other test in pmtu.sh should tell.        
> >>>>>>
> >>>>>> Every type of bridge port that needs to add additional header on egress
> >>>>>> has this problem in the bridge scenario once the peer of the IP tunnel
> >>>>>> signals a PMTU event.      
> >>>>>
> >>>>> Yes :(    
> >>>>
> >>>> The vxlan/tunnel device knows it is a bridge port, and it knows it is
> >>>> going to push a udp and ip{v6} header. So why not use that information
> >>>> in setting / updating the MTU? That's what I was getting at on Monday
> >>>> with my comment about lwtunnel_headroom equivalent.    
> >>>
> >>> If I understand correctly, you're proposing something similar to my
> >>> earlier draft from:
> >>>
> >>> 	<20200713003813.01f2d5d3@...sabeth>
> >>> 	https://lore.kernel.org/netdev/20200713003813.01f2d5d3@elisabeth/
> >>>
> >>> the problem with it is that it wouldn't help: the MTU is already set to
> >>> the right value for both port and bridge in the case Florian originally
> >>> reported.    
> >>
> >> I am definitely hand waving; I have not had time to create a setup
> >> showing the problem. Is there a reproducer using only namespaces?  
> > 
> > And I'm laser pointing: check the bottom of that email ;)
> >   
> 
> With this test case, the lookup fails:
> 
> [  144.689378] vxlan: vxlan_xmit_one: dev vxlan_a 10.0.1.1/57864 ->
> 10.0.0.0/4789 len 5010 gw 10.0.1.2
> [  144.692755] vxlan: skb_tunnel_check_pmtu: dst dev br0 skb dev vxlan_a
> skb len 5010 encap_mtu 4000 headroom 50
> [  144.697682] vxlan: skb_dst_update_pmtu_no_confirm: calling
> ip_rt_update_pmtu+0x0/0x160/ffffffff825ee850 for dev br0 mtu 3950
> [  144.703601] IPv4: __ip_rt_update_pmtu: dev br0 mtu 3950 old_mtu 5000
> 192.168.2.1 -> 192.168.2.2
> [  144.708177] IPv4: __ip_rt_update_pmtu: fib_lookup failed for
> 192.168.2.1 -> 192.168.2.2
> 
> Because the lookup fails, __ip_rt_update_pmtu skips creating the exception.
> 
> This hack gets the lookup to succeed:
> 
> fl4->flowi4_oif = dst->dev->ifindex;
> or
> fl4->flowi4_oif = 0;

Oh, I didn't consider that... route. :) Here comes an added twist, which
currently needs Florian's changes from:
	https://git.breakpoint.cc/cgit/fw/net-next.git/log/?h=udp_tun_pmtud_12

Test is as follows:

test_pmtu_ipv4_vxlan4_exception_bridge() {
	test_pmtu_ipvX_over_vxlanY_or_geneveY_exception vxlan  4 4

	ip netns add ns-C

	ip -n ns-C link add veth_c_a type veth peer name veth_a_c
	ip -n ns-C link set veth_a_c netns ns-A

	ip -n ns-C addr add 192.168.2.100/24 dev veth_c

	ip -n ns-C link set dev veth_c_a mtu 5000
	ip -n ns-C link set veth_c_a up
	ip -n ns-A link set dev veth_a_c mtu 5000
	ip -n ns-A link set veth_c_a up

	ip -n ns-A link add br0 type bridge
	ip -n ns-A link set br0 up
	ip -n ns-A link set dev br0 mtu 5000
	ip -n ns-A link set veth_a_c master br0
	ip -n ns-A link set vxlan_a master br0

	ip -n ns-A addr del 192.168.2.1/24 dev vxlan_a
	ip -n ns-A addr add 192.168.2.1/24 dev br0

	ip -n ns-C exec ping -c 1 -w 2 -M want -s 5000 192.168.2.2
}

I didn't check the test itself recently, I'm just copying from some
local changes I was trying last week, some commands might be wrong.

The idea is: what if we now have another host (here, it's ns-C) sending
traffic to that bridge? Then the exception on a local interface isn't
enough, we actually need to send Fragmentation Needed back to where the
packet came from, and the bridge won't do it for us (with routing, it
already works).

I haven't tried your hack, but I guess it would have the same problem.

-- 
Stefano