netdev - Re: [PATCH] openvswitch: perform refragmentation for packets which pass through conntrack

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f7tk0pcmru0.fsf@dhcp-25.97.bos.redhat.com>
Date:   Thu, 08 Apr 2021 16:41:59 -0400
From:   Aaron Conole <aconole@...hat.com>
To:     Joe Stringer <joe@...ium.io>
Cc:     Networking <netdev@...r.kernel.org>, dev@...nvswitch.org,
        Pravin B Shelar <pshelar@....org>,
        Eelco Chaudron <echaudro@...hat.com>,
        Dan Williams <dcbw@...hat.com>,
        Michael Cambria <mcambria@...hat.com>,
        Flavio Leitner <fbl@...hat.com>
Subject: Re: [PATCH] openvswitch: perform refragmentation for packets which pass through conntrack

Joe Stringer <joe@...ium.io> writes:

> Hey Aaron, long time no chat :)

Same :)

> On Fri, Mar 19, 2021 at 1:43 PM Aaron Conole <aconole@...hat.com> wrote:
>>
>> When a user instructs a flow pipeline to perform connection tracking,
>> there is an implicit L3 operation that occurs - namely the IP fragments
>> are reassembled and then processed as a single unit.  After this, new
>> fragments are generated and then transmitted, with the hint that they
>> should be fragmented along the max rx unit boundary.  In general, this
>> behavior works well to forward packets along when the MTUs are congruent
>> across the datapath.
>>
>> However, if using a protocol such as UDP on a network with mismatching
>> MTUs, it is possible that the refragmentation will still produce an
>> invalid fragment, and that fragmented packet will not be delivered.
>> Such a case shouldn't happen because the user explicitly requested a
>> layer 3+4 function (conntrack), and that function generates new fragments,
>> so we should perform the needed actions in that case (namely, refragment
>> IPv4 along a correct boundary, or send a packet too big in the IPv6 case).
>>
>> Additionally, introduce a test suite for openvswitch with a test case
>> that ensures this MTU behavior, with the expectation that new tests are
>> added when needed.
>>
>> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
>> Signed-off-by: Aaron Conole <aconole@...hat.com>
>> ---
>> NOTE: checkpatch reports a whitespace error with the openvswitch.sh
>>       script - this is due to using tab as the IFS value.  This part
>>       of the script was copied from
>>       tools/testing/selftests/net/pmtu.sh so I think should be
>>       permissible.
>>
>>  net/openvswitch/actions.c                  |   2 +-
>>  tools/testing/selftests/net/.gitignore     |   1 +
>>  tools/testing/selftests/net/Makefile       |   1 +
>>  tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++
>>  4 files changed, 397 insertions(+), 1 deletion(-)
>>  create mode 100755 tools/testing/selftests/net/openvswitch.sh
>>
>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
>> index 92a0b67b2728..d858ea580e43 100644
>> --- a/net/openvswitch/actions.c
>> +++ b/net/openvswitch/actions.c
>> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,
>>                 if (likely(!mru ||
>>                            (skb->len <= mru + vport->dev->hard_header_len))) {
>>                         ovs_vport_send(vport, skb, ovs_key_mac_proto(key));
>> -               } else if (mru <= vport->dev->mtu) {
>> +               } else if (mru) {
>>                         struct net *net = read_pnet(&dp->net);
>>
>>                         ovs_fragment(net, vport, skb, mru, key);
>
> I thought about this for a while. For a bit of context, my
> recollection is that in the initial design, there was an attempt to
> minimize the set of assumptions around L3 behaviour and despite
> performing this pseudo-L3 action of connection tracking, attempt a
> "bump-in-the-wire" approach where OVS is serving as an L2 switch and
> if you wanted L3 features, you need to build them on top or explicitly
> define that you're looking for L3 semantics. In this case, you're
> interpreting that the combination of the conntrack action + an output
> action implies that L3 routing is being performed. Hence, OVS should
> act like a router and either refragment or generate ICMP PTB in the
> case where MTU differs. According to the flow table, the rest of the
> routing functionality (MAC handling for instance) may or may not have
> been performed at this point, but we basically leave that up to the
> SDN controller to implement the right behaviour. In relation to this
> particular check, the idea was to retain the original geometry of the
> packet such that it's as though there were no functionality performed
> in the middle at all. OVS happened to do connection tracking (which
> implicitly involved queueing fragments), but if you treat it as an
> opaque box, you have ports connected and OVS is simply performing
> forwarding between the ports.

I've been going back and forth on this.  On the one hand, Open vSwitch
is supposed to only care about 'just' the L2 forwarding, with some
additional processing to assist.  After that, it's up to an L3 layer to
really provide additional support, and the idea is that the controller
or something else should really be guiding this higher level
intelligence.

The issue I have is that we do some of the high level intelligence here
to support conntrack, and it's done in a way that is a bit unintuitive.
As an example, you write:

  ... the idea was to retain the original geometry of the packet such
  that it's as though there were no functionality performed in the
  middle at all

But, the fragmentation engine isn't guaranteed to reassemble exactly the
same packets.

Consider the scenario where there is an upstream router that has
performed it's own mitm fragmentation.  There can be a sequence of
packets after that that looks like:

  IPID == 1,  pkt 1 (1410 bytes), pkt 2 (64 bytes), pkt 3 (1000 bytes)

When reassembled by the frag engine, we will only use the MRU as the
guide, and that will spit out:

  IPID == 1,  pkt 1 (1410 bytes), pkt 2 (1044 bytes)

We will have reduced the number of packets moving through the network,
and then aren't acting as a bump in the wire, but as a real entity.

I even tested this:

  04:28:47 root@...p-25 /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP 172.31.110.2' test_mismatched_mtu_with_conntrack/c1-pkts.cap 
  16:25:43.481072 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360
  16:25:43.525972 IP 172.31.110.2 > 172.31.110.1: ip-proto-17
  16:25:43.567272 IP 172.31.110.2 > 172.31.110.1: ip-proto-17
  bash: __git_ps1: command not found
  04:28:54 root@...p-25 /home/aconole/git/linux/tools/testing/selftests/net# grep 'IP 172.31.110.2' test_mismatched_mtu_with_conntrack/s1-pkts.cap
  16:25:43.567435 IP 172.31.110.2.52352 > 172.31.110.1.ddi-udp-1: UDP, bad length 1901 > 1360
  16:25:43.567438 IP 172.31.110.2 > 172.31.110.1: ip-proto-17

Additionally, because this happens transparently for the flow rule user,
we need to run check_pkt_len() after every call to the conntrack action
because there is really no longer a way to know whether the packet came
in via a fragmented path.  I guess we could do something with
ip.frag==first|later|no... selection rules to try and create a custom
table for handling fragments - but that seems like it's a workaround for
the conntrack functionality w.r.t. the fragmentation engine.

> One of the related implications is the contrast between what happens
> in this case if you have a conntrack action injected or not when
> outputting to another port. If you didn't put a connection tracking
> action into the flows when redirecting here, then there would be no
> defragmentation or refragmentation. In that case, OVS is just
> attempting to forward to another device and if the MTU check fails,
> then bad luck, packets will be dropped. Now, with the interpretation
> in this patch, it seems like we're trying to say that, well, actually,
> if the controller injects a connection tracking action, then the
> controller implicitly switches OVS into a sort of half-L3 mode for
> this particular flow. This makes the behaviour a bit inconsistent.

I agree, the behavior will be inconsistent w.r.t. L3.  But it is right
now also.  And at least with this change we will be consistently
inconsistent - the user requests ct() with the L3 functions that it
implies.

One other problem with the controller is the way we need to generate
FRAGNEED packets in v4.  The spec is quite clear with DF=1, drop and
generate.  With DF=0, it's less clear (at least after I re-checked RFC
791 I didn't see anything, but I might have missed it).  The controller
will now receive all the traffic, I guess.  It's okay with TCP flows
that set DF=1, but for UDP (maybe other protocols) that isn't the case.

> Another thought that occurs here is that if you have three interfaces
> attached to the switch, say one with MTU 1500 and two with MTU 1450,
> and the OVS flows are configured to conntrack and clone the packets
> from the higher-MTU interface to the lower-MTU interfaces. If you
> receive larger IP fragments on the first interface and attempt to
> forward on to the other interfaces, should all interfaces generate an
> ICMPv6 PTB?

I guess they would, for each destination.  I don't know if it's
desirable, but I can see how it would generate a lot of traffic.  Then
again, why should it?  Would conntrack determine that we have two
interfaces to output: actions?

> That doesn't seem quite right, especially if one of those
> ports is used for mirroring the traffic for operational reasons while
> the other path is part of the actual routing path for the traffic.

I didn't consider the mirroring case.  I guess we would either need some
specific metadata.  I don't know that anyone is making a mirror port
this way, but I guess if the bug report comes in I'll look at it ;)

> You'd end up with duplicate PTB messages for the same outbound
> request. If I read right, this would also not be able to be controlled
> by the OVS controller because when we call into ip6_fragment() and hit
> the MTU-handling path, it will automatically take over and generate
> the ICMP response out the source interface, without any reference to
> the OVS flow table. This seems like it's further breaking the model
> where instead of OVS being a purely programmable L2-like flow
> match+actions pipeline, now depending on the specific actions you
> inject (in particular combinations), you get some bits of the L3
> functionality. But for full L3 functionality, the controller still
> needs to handle the rest through the correct set of actions in the
> flow.

It's made more difficult because ct() action itself does L3 processing
(and I think I demonstrated this).

> Looking at the tree, it seems like this problem can be solved in
> userspace without further kernel changes by using
> OVS_ACTION_ATTR_CHECK_PKT_LEN, see commit 4d5ec89fc8d1 ("net:
> openvswitch: Add a new action check_pkt_len"). It even explicitly says
> "The main use case for adding this action is to solve the packet drops
> because of MTU mismatch in OVN virtual networking solution.". Have you
> tried using this approach?

We looked and discussed it a bit.  I think the outcome boils down to
check_pkt_len needs to be used on every single instance where a ct()
call occurs because ct() implies we have connections to monitor, and
that implies l3, so we need to do something (either push to a controller
and handle it like OVN would, etc).  This has implications on older
versions of OvS userspace (that don't support check_pkt_len action), and
non-OVN based controllers (that might just program a flow pipeline and
expect it to work).

I'm not sure what the best approach is really.

>> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
>> index 61ae899cfc17..d4d7487833be 100644
>> --- a/tools/testing/selftests/net/.gitignore
>> +++ b/tools/testing/selftests/net/.gitignore
>> @@ -30,3 +30,4 @@ hwtstamp_config
>>  rxtimestamp
>>  timestamping
>>  txtimestamp
>> +test_mismatched_mtu_with_conntrack
>> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
>> index 25f198bec0b2..dc9b556f86fd 100644
>> --- a/tools/testing/selftests/net/Makefile
>> +++ b/tools/testing/selftests/net/Makefile
>
> Neat to see some bootstrapping of in-tree OVS testing. I'd probably
> put it in a separate commit but maybe that's just personal preference.

I figured I should add it here because it demonstrates the issue I'm
trying to solve.  But I agree, it's maybe a new functionality, so I'm
okay with submitting this part + test cases with net-next instead.

> I didn't look *too* closely at the tests but just one nit below:
>
>> +       # test a udp connection
>> +       info "send udp data"
>> + ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u
>> 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! >
>> ${ovs_dir}/server.pid'
>
> There are multiple netcat implementations with different arguments
> (BSD and nmap.org and maybe also Debian versions). Might be nice to
> point out which netcat you're relying on here or try to detect & fail
> out/skip on the wrong one. For reference, the equivalent OVS test code
> detection is here:

netcat's -l, -v, and -u options are universal (even to the old 'hobbit'
1.10 netcat), so I don't think we need to do detection for these
options.  If a future test needs something special (like 'send-only' for
nmap-ncat), then it probably makes sense to hook something up then.

> https://github.com/openvswitch/ovs/blob/80e74da4fd8bfdaba92105560ce144b4b2d00e36/tests/atlocal.in#L175