netdev - Re: [PATCH net-next 1/3] udp_tunnel: allow to turn off path mtu discovery on encap sockets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200715135811.079ccdb7@elisabeth>
Date:   Wed, 15 Jul 2020 13:58:11 +0200
From:   Stefano Brivio <sbrivio@...hat.com>
To:     Aaron Conole <aconole@...hat.com>
Cc:     Numan Siddique <nusiddiq@...hat.com>,
        Florian Westphal <fw@...len.de>,
        netdev <netdev@...r.kernel.org>, David Ahern <dsahern@...il.com>
Subject: Re: [PATCH net-next 1/3] udp_tunnel: allow to turn off path mtu
 discovery on encap sockets

Hi Aaron,

On Tue, 14 Jul 2020 16:38:29 -0400
Aaron Conole <aconole@...hat.com> wrote:

> Numan Siddique <nusiddiq@...hat.com> writes:
> 
> > On Mon, Jul 13, 2020 at 3:34 PM Stefano Brivio <sbrivio@...hat.com> wrote:  
> >>
> >> On Mon, 13 Jul 2020 10:04:13 +0200
> >> Florian Westphal <fw@...len.de> wrote:
> >>  
> >> > Stefano Brivio <sbrivio@...hat.com> wrote:  
> >> > > Hi,
> >> > >
> >> > > On Sun, 12 Jul 2020 22:07:03 +0200
> >> > > Florian Westphal <fw@...len.de> wrote:
> >> > >  
> >> > > > vxlan and geneve take the to-be-transmitted skb, prepend the
> >> > > > encapsulation header and send the result.
> >> > > >
> >> > > > Neither vxlan nor geneve can do anything about a lowered path mtu
> >> > > > except notifying the peer/upper dst entry.  
> >> > >
> >> > > It could, and I think it should, update its MTU, though. I didn't
> >> > > include this in the original implementation of PMTU discovery for UDP
> >> > > tunnels as it worked just fine for locally generated and routed
> >> > > traffic, but here we go.  
> >> >
> >> > I don't think its a good idea to muck with network config in response
> >> > to untrusted entity.  
> >>
> >> I agree that this (changing MTU on VXLAN) looks like a further step,
> >> but the practical effect is zero: we can't send those packets already
> >> today.
> >>
> >> PMTU discovery has security impacts, and they are mentioned in the
> >> RFCs. Also here, we wouldn't increase the MTU as a result, and if the
> >> entity is considered untrusted, considerations from RFC 8201 and RFC
> >> 4890 cover that.
> >>
> >> In practice, we might have broken networks, but at a practical level, I
> >> guess it's enough to not make the situation any worse.
> >>  
> >> > > As PMTU discovery happens, we have a route exception on the lower
> >> > > layer for the given path, and we know that VXLAN will use that path,
> >> > > so we also know there's no point in having a higher MTU on the VXLAN
> >> > > device, it's really the maximum packet size we can use.  
> >> >
> >> > No, in the setup that prompted this series the route exception is wrong.
> >> > The current "fix" is a shell script that flushes the exception as soon
> >> > as its added to keep the tunnel working...  
> >>
> >> Oh, oops.
> >>
> >> Well, as I mentioned, if this is breaking setups and this series is the
> >> only way to fix things, I have nothing against it, we can still work on
> >> a more comprehensive solution (including the bridge) once we have it.
> >>  
> >> > > > Some setups, however, will use vxlan as a bridge port (or openvs vport).  
> >> > >
> >> > > And, on top of that, I think what we're missing on the bridge is to
> >> > > update the MTU when a port lowers its MTU. The MTU is changed only as
> >> > > interfaces are added, which feels like a bug. We could use the lower
> >> > > layer notifier to fix this.  
> >> >
> >> > I will defer to someone who knows bridges better but I think that
> >> > in bridge case we 100% depend on a human to set everything.  
> >>
> >> Not entirely, MTU is auto-adjusted when interfaces are added (unless
> >> the user set it explicitly), however:
> >>  
> >> > bridge might be forwarding frames of non-ip protocol and I worry that
> >> > this is a self-induced DoS when we start to alter configuration behind
> >> > sysadmins back.  
> >>
> >> ...yes, I agree that the matter with the bridge is different. And we
> >> don't know if that fixes anything else than the selftest I showed, so
> >> let's forget about the bridge for a moment.
> >>  
> >> > > I tried to represent the issue you're hitting with a new test case in
> >> > > the pmtu.sh selftest, also included in the diff. Would that work for
> >> > > Open vSwitch?  
> >> >
> >> > No idea, I don't understand how it can work at all, we can't 'chop
> >> > up'/mangle l2 frame in arbitrary fashion to somehow make them pass to
> >> > the output port.  We also can't influence MTU config of the links peer.  
> >>
> >> Sorry I didn't expand right away.
> >>
> >> In the test case I showed, it works because at that point sending
> >> packets to the bridge will result in an error, and the (local) sender
> >> fragments. Let's set this aside together with the bridge affair, though.
> >>
> >> Back to VXLAN and OVS: OVS implements a "check_pkt_len" action, cf.
> >> commit 4d5ec89fc8d1 ("net: openvswitch: Add a new action
> >> check_pkt_len"), that should be used when packets exceed link MTUs:
> >>
> >>   With the help of this action, OVN will check the packet length
> >>   and if it is greater than the MTU size, it will generate an
> >>   ICMP packet (type 3, code 4) and includes the next hop mtu in it
> >>   so that the sender can fragment the packets.
> >>
> >> and my understanding is that this can only work if we reflect the
> >> effective MTU on the device itself (including VXLAN).
> >>  
> >
> > check_pkt_len is OVS datapath action and the corresponding OVS action
> > is  check_pkt_larger.
> >
> > Logically It is expected to use this way in the OVS flows- >
> >     reg0[0] = check_pkt_larger(1500);
> >     if reg0[0[ == 1; then take some action.
> >
> > In the case of OVN, if the register reg0[0] bit is set, then we
> > generate ICMP error packet (type 3, code 4).
> >
> > I don't know the requirements or the issue this patch is trying to
> > address. But I think for OVS, there has to be
> > a controller (like OVN) which makes use of the check_pkt_larger action
> > and takes necessary action by adding
> > appropriate OF flows based on the result of check_pkt_larger.  
> 
> Hi Numan,
> 
> The issue is that a route exception might lower the MTU for the
> destination, and the controller would need to be made aware of that.
> 
> In that case, it could update any check_packet_len rules.  But it's not
> desirable for this type of rules to be explicitly required at all, imo.
> And for setups where user wants to use action=normal, or no openflow
> controller is used (a large number of OvS deployments), I think it will
> still be broken.

I agree that the controller shouldn't deal with this, because it's not
a configuration topic, unless the user wants to set particular MTUs --
in which case sure, it's configuration.

Open vSwitch is responsible for forwarding packets, why can't it
relay/use the required ICMP messages?

> I've cooked up a change to OvS to correspond to this series:
> 
> https://github.com/orgcandman/ovs/commit/0c063e4443dda1f62c9310bda7f54140b9dc9c31

From the commit message:

> Some network topologies use vxlan/geneve tunnels in a non-routed setup,
> and the presence of a route exception would interfere with packet
> egress, causing unresolvable network conditions due to MTU mismatches.

Florian explained clearly why PMTU discovery is currently not working
*with a Linux bridge* connected to some type of tunnels (I haven't checked
IP tunnels yet), and proposed a solution fixing that problem. However,
he also mentioned that solution doesn't fix the issue with Open vSwitch,
because it's discarding the (additional) ICMP messages implemented there.

I proposed another solution that also works, only for a Linux bridge
though.

Open vSwitch, as far as I understand, doesn't use Linux bridges. What
is the actual problem with Open vSwitch? How does the route exception
"interfere with packet egress"? Why do you have "MTU mismatches"? The
route exception itself is correct. Can you elaborate?

> Some operating systems allow the user to choose the path MTU discovery
> strategy, whether it should ignore dst cache exception information, etc.

Linux does too, you can set /proc/sys/net/ipv4/ip_no_pmtu_disc to 1 and
PMTU discovery is disabled, for IPv4. I don't think that breaking PMTU
discovery only on some selected interfaces, that could be routed
between each other, is a sane option. I also wouldn't recommend
disabling it (for the reasons explained in RFC 1191), especially in a
complex networking environment.

For IPv6, it's "just" strongly recommended by RFC 8200 to implement it.
However, if a node doesn't implement it, it must restrict itself to the
minimum link MTU, 1280 bytes. By allowing to disable PMTU discovery on
IPv6 nodes and not enforcing that limit we introduce a significant
breakage.

-- 
Stefano