netdev - Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEh+42hStEh=5z8KBic0hmEZNti4tjahySjjv7kU3YUes_R4Zg@mail.gmail.com>
Date:	Fri, 19 Feb 2016 16:08:04 -0800
From:	Jesse Gross <jesse@...nel.org>
To:	Alex Duyck <aduyck@...antis.com>
Cc:	Linux Kernel Network Developers <netdev@...r.kernel.org>,
	David Miller <davem@...emloft.net>,
	Alexander Duyck <alexander.duyck@...il.com>
Subject: Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck <aduyck@...antis.com> wrote:
> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross <jesse@...nel.org> wrote:
>> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <aduyck@...antis.com> wrote:
>>> This patch series makes it so that we enable the outer Tx checksum for IPv4
>>> tunnels by default.  This makes the behavior consistent with how we were
>>> handling this for IPv6.  In addition I have updated the internal flags for
>>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
>>> match up will with the ZERO_CSUM6_TX flag which was already in use for
>>> IPv6.
>>>
>>> For most network devices this should be a net gain in terms of performance
>>> as having the outer header checksum present allows for devices to report
>>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
>>> to determine if the inner header checksum is valid.
>>>
>>> Below is some data I collected with ixgbe with an X540 that demonstrates
>>> this.  I located two PFs connected back to back in two different name
>>> spaces and then setup a pair of tunnels on each, one with checksum enabled
>>> and one without.
>>>
>>> Recv   Send    Send                          Utilization
>>> Socket Socket  Message  Elapsed              Send
>>> Size   Size    Size     Time     Throughput  local
>>> bytes  bytes   bytes    secs.    10^6bits/s  % S
>>>
>>> noudpcsum:
>>>  87380  16384  16384    30.00      8898.67   12.80
>>> udpcsum:
>>>  87380  16384  16384    30.00      9088.47   5.69
>>>
>>> The one spot where this may cause a performance regression is if the
>>> environment contains devices that can parse the inner headers and a device
>>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
>>> the case of such a device we have to fall back to using GSO to segment the
>>> tunnel instead of TSO and as a result we may take a performance hit as seen
>>> below with i40e.
>>
>> Do you have any numbers from 40G links? Obviously, at 10G the links
>> are basically saturated and while I can see a difference in the
>> utilization rate, I suspect that the change will be much more apparent
>> at higher speeds.
>
> Unfortunately I don't have any true 40G links to test with.  The
> closest I can get is to run PF to VF on an i40e.  Running that I have
> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the
> difference being related to the fact that we are having to
> allocate/free more skbs and make more trips through the
> i40e_lan_xmit_frame function resulting in more descriptors.

OK, I guess that is more or less in line with what I would expect off
the top my head. There is a reasonably significant drop in the worst
case.

>> I'm concerned about the drop in performance for devices that currently
>> support offloads (almost none of which expose
>> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
>> care most about tunnel performance are the ones that already have
>> these NICs and will be the most impacted by the drop.
>
> The problem is being able to transmit fast is kind of pointless if the
> receiving end cannot handle it.  We hadn't gotten around to really
> getting the Rx checksum bits working until the 3.18 kernel which I
> don't suspect many people are running so at this point messing with
> the TSO bits isn't really making much of a difference.  Then on top of
> that most devices have certain limitations on how many ports they can
> handle and such.  I know the i40e is supposed to support something
> like 10 port numbers, but the fm10k and ixgbe are limited to one port
> as I recall.  So this whole thing is already really brittle as it is.
> My goal with this change is to make the behavior more consistent
> across the board.

That's true to some degree but there are certainly plenty of cases
where TSO makes a difference - lower CPU usage, transmitting to
multiple receivers, people will upgrade their kernels, etc. It's
clearly good to make things more consistent but hopefully not by
reducing existing performance. :)

>> My hope is that we can continue to use TSO on devices that only
>> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
>> length field may vary across segments. However, in practice this is
>> the only on the final segment and only in cases where the total length
>> is not a multiple of the MSS. If we could detect cases where those
>> conditions are met, we could continue to use TSO with the UDP checksum
>> field pre-populated. A possible step even further would be to break
>> off the final segment into a separate packet to make things conform if
>> necessary. This would avoid a performance regression and I think make
>> this more palatable to a lot of people.
>
> I think Tom and I had discussed this possibility a bit at netconf.
> The GSO logic is something I planned on looking at over the next
> several weeks as I suspect there is probably room for improvement
> there.

That sounds great.

>>> I also haven't investigated the effect this will have on OVS.  However I
>>> suspect the impact should be minimal as the worst case scenario should be
>>> that Tx checksumming will become enabled by default which should be
>>> consistent with the existing behavior for IPv6.
>>
>> I don't think that it should cause any problems.
>
> Good to hear.
>
> Do you know if OVS has some way to control the VXLAN configuration so
> that it could disable Tx checksums?  If so that would probably be a
> good way to address the 40G issues assuming someone is running an
> environment hat had nothing but NICs that can support the TSO and Rx
> checksum on inner headers.

Yes - OVS can control tx checksums on a per-endpoint basis (actually,
rx checksum present requirements as well though it's not exposed to
the user at the moment). If you had the information then you could
optimize what to use in an environment of, say, hypervisors and
hardware switches.

However, it's certainly possible that you have a mixed set of NICs
such as an encap aware NIC on the transmit side and non-aware on the
receive side. In that case, both possible checksum settings penalize
somebody: off (lose GRO on receiver), on (lose TSO on sender assuming
no support for NETIF_F_GSO_UDP_TUNNEL_CSUM). That's why I think it's
important to be able to use encap TSO with local checksum to avoid
these bad tradeoffs, not to mention being cleaner.