[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8760cytnif.fsf@stressinduktion.org>
Date: Mon, 04 Sep 2017 18:52:08 +0200
From: Hannes Frederic Sowa <hannes@...essinduktion.org>
To: Tom Herbert <tom@...bertland.com>
Cc: Saeed Mahameed <saeedm@....mellanox.co.il>,
Saeed Mahameed <saeedm@...lanox.com>,
"David S. Miller" <davem@...emloft.net>,
Linux Netdev List <netdev@...r.kernel.org>
Subject: Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
Hello Tom,
Tom Herbert <tom@...bertland.com> writes:
> On Mon, Sep 4, 2017 at 6:50 AM, Hannes Frederic Sowa
> <hannes@...essinduktion.org> wrote:
>> Tom Herbert <tom@...bertland.com> writes:
>>
>>> An encapsulator sets the UDP source port to be the flow entropy of the
>>> packet being encapsulated. So when the packet traverses the network
>>> devices can base their hash just on the canonical 5-tuple which is
>>> sufficient for ECMP and RSS. IPv6 flow label is even better since the
>>> middleboxes don't even need to look at the transport header, a packet
>>> is steered based on the 3-tuple of addresses and flow label. In the
>>> Linux stack, udp_flow_src_port is used by UDP encapsulations to set
>>> the source port. Flow label is similarly set by ip6_make_flowlabel.
>>> Both of these functions use the skb->hash which is computed by calling
>>> flow dissector at most once per packet (if the packet was received
>>> with an L4 HW hash or locally originated on a connection the hash does
>>> not need to be computed).
>>
>> This would require the MPLS stack copying the flowlabel of IPv6
>> connections between MPLS routers over their whole lifetime in the MPLS
>> network. The same would hold for MPLS encapsulated inside UDP, the
>> source port needs to be kept constant. This is very impractical. The
>> hash for the flow label can often not be recomputed by interim routers,
>> because they might lack the knowledge of the upper layer protocol.
>>
> Hannes,
>
> When the flow label is set the packet will traverse the network and be
> ECMP routed regardless of whether the payload is MPLS at anything
> else-- the important characteristic is that network devices don't need
> to know how to parse MPLS (or GRE, or IPIP, or L2TP, ESP, or ...) to
> provide good ECMP. At a source the flow label or UDP source port needs
> to be generated. That can be based on DPI, derived from the MPLS
> entropy label, use SPI in ESP, etc. I don't see anything special about
> MPLS in this regard.
The MPLS circuit is only end to end in terms of IP processing if MPLS is
used for multitenant separation.
Normally the IP connection is done between two label switch routers,
thus is not end to end. One LSR will decapsulate the packet and throw
the IP header away, do the label processing and will reencapsulate it
with the new next hop information. To keep the assigned entropy alive it
would have to save the UDP source port or flowlabel and patch the
outgoing IP header again. This is certainly possible it just seems more
unnatural.
Normally every next hop does MPLS processing and thus the packet
traverses up the stack. Special purpose (entropy) MPLS labels allow the
stack to achieve RSS just based on the label stack and will be
end-to-end in a MPLS cloud.
>> UDP source port entropy still has the problem that we don't respect the
>> source port as RSS entropy by default in network cards, because of
>> possible fragmentation and thus possible reordering of packets. GRE does
>> not have this problem and is way easier to identify by hardware.
>>
>> Basically we need to tell network cards that they can use specific
>> destination UDP ports where we allow the source port to be used in RSS
>> hash calculation. I don't see how this is any easier than just using GRE
>> with a defined protocol field? I do like the combination of ipv6
>> flowlabel + GRE.
>>
> No, we don't any more want port specific configuration in NICs! The
> NIC should just fallback to 3-tuple hash when it see MF or offset set
> in IPv4 header. But even if it doesn't implement this, receiving OOO
> fragments is hardly the end of the world-- IP packets are always
> allowed to be received OOO. If something breaks because in order
> delivery is assumed then that is the bug that needs to be fixed. So at
> best handling fragmentation in this manner is proposed om
> optimization whose benefits will pale to getting good ECMP and RSS
> when encapsulation is in use.
The problem is that you end up having two streams, one fragmented and
one non-fragmented, but actually they belong to the same stream. It is
known to break stuff, see:
<https://patchwork.ozlabs.org/patch/59235/>
I would agree with you, but we can't break existing setups,
unfortunately.
>> Btw. people are using the GRE Key as additional entropy without looking
>> into the GRE payload.
>>
> Sure some are, but the GRE key is not defined to be flow entropy so
> it's not ubiquitous it used for that so it gives sufficient entropy or
> is even constant per flow. GRE/UDP (RFC8086) was primarily written to
> allow a more consistent method (as was RFC7510 for doing MPLS/UDP).
I agree with that, just wanted to mention it.
Bye,
Hannes
Powered by blists - more mailing lists