netdev - Re: [PATCH v2 net-next 1/5] net: local checksum offload for encapsulation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <568FD689.3070300@solarflare.com>
Date:	Fri, 8 Jan 2016 15:32:25 +0000
From:	Edward Cree <ecree@...arflare.com>
To:	Alexander Duyck <alexander.duyck@...il.com>
CC:	David Miller <davem@...emloft.net>,
	Netdev <netdev@...r.kernel.org>,
	<linux-net-drivers@...arflare.com>,
	Tom Herbert <tom@...bertland.com>
Subject: Re: [PATCH v2 net-next 1/5] net: local checksum offload for
 encapsulation

On 07/01/16 22:53, Alexander Duyck wrote:
> On Thu, Jan 7, 2016 at 9:12 AM, Edward Cree <ecree@...arflare.com> wrote:
>> The arithmetic properties of the ones-complement checksum mean that a
>>  correctly checksummed inner packet, including its checksum, has a ones
>>  complement sum depending only on whatever value was used to initialise
>>  the checksum field before checksumming (in the case of TCP and UDP,
>>  this is the ones complement sum of the pseudo header, complemented).
>> Consequently, if we are going to offload the inner checksum with
>>  CHECKSUM_PARTIAL, we can compute the outer checksum based only on the
>>  packed data not covered by the inner checksum, and the initial value of
>>  the inner checksum field.
>>
>> Signed-off-by: Edward Cree <ecree@...arflare.com>
>> ---
>>  include/linux/skbuff.h    | 26 ++++++++++++++++++++++++++
>>  net/ipv4/ip_tunnel_core.c |  4 ++++
>>  net/ipv4/udp.c            | 29 ++++++++++-------------------
>>  net/ipv6/ip6_checksum.c   | 24 ++++++++----------------
>>  4 files changed, 48 insertions(+), 35 deletions(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 6b6bd42..6590d08 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -3665,5 +3665,31 @@ static inline unsigned int skb_gso_network_seglen(const struct sk_buff *skb)
>>         return hdr_len + skb_gso_transport_seglen(skb);
>>  }
>>
>> +/* Local Checksum Offload.
>> + * Compute outer checksum based on the assumption that the
>> + * inner checksum will be offloaded later.
>> + * See Documentation/networking/tx-offloads.txt for
>> + * explanation of how this works.
>> + * Fill in outer checksum adjustment (e.g. with sum of outer
>> + * pseudo-header) before calling.
>> + * Also ensure that inner checksum is in linear data area.
>> + */
>> +static inline __wsum lco_csum(struct sk_buff *skb)
>> +{
>> +       char *inner_csum_field;
>> +       __wsum csum;
>> +
>> +       /* Start with complement of inner checksum adjustment */
>> +       inner_csum_field = skb->data + skb_checksum_start_offset(skb) +
>> +                               skb->csum_offset;
> You would probably benefit from caching off the result of
> skb_checksum_start_offset into a local variable so the compiler
> doesn't go through and recompute it when you call it again below.
It's a nearly-trivial inline function; won't the compiler be smart enough to
 cache that result itself?
>> +       csum = ~csum_unfold(*(__force __sum16 *)inner_csum_field);
> This seems like a lot of work, couldn't you get away with just
> bit-flipping this and moving it into uh->check on the outer header?
It's not a lot of work: all this does is zero-extend to 32 bits and flip.
It looks like more, but most of it is just a cast; it's written in this way
 to pacify sparse while using as little __force as possible.
lco_csum can't move it into uh->check, because it doesn't have uh.  In fact,
 the skb might not even be UDP - this function is intended to be used also
 for GRE, which has an IP-style checksum but in a different place.  (Next
 version of the patch series will implement that btw)
>> +       /* Add in checksum of our headers (incl. outer checksum
>> +        * adjustment filled in by caller)
>> +        */
>> +       csum = skb_checksum(skb, 0, skb_checksum_start_offset(skb), csum);
>> +       /* The result is the outer checksum */
>> +       return csum;
>> +}
>> +
> The more I think about it I am not sure how much there is to be gained
> by having this as a separate function anyway since I think you might
> be able to better exploit things with a few changes to the ordering of
> operations.  See my notes below in the IPv4 section.
>
>>  #endif /* __KERNEL__ */
>>  #endif /* _LINUX_SKBUFF_H */
>> diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
>> index 1db8418..f39b064 100644
>> --- a/net/ipv4/ip_tunnel_core.c
>> +++ b/net/ipv4/ip_tunnel_core.c
>> @@ -146,6 +146,10 @@ struct metadata_dst *iptunnel_metadata_reply(struct metadata_dst *md,
>>  }
>>  EXPORT_SYMBOL_GPL(iptunnel_metadata_reply);
>>
>> +/* csum_help should only be ever true if the protocol doesn't support LCO.
>> + * If the tunnel uses udp_tunnel_xmit_skb(), then it gets LCO for free, and
>> + * should always set csum_help to false.
>> + */
>>  struct sk_buff *iptunnel_handle_offloads(struct sk_buff *skb,
>>                                          bool csum_help,
>>                                          int gso_type_mask)
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index 8841e98..c1c73be 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -767,32 +767,23 @@ void udp_set_csum(bool nocheck, struct sk_buff *skb,
>>  {
>>         struct udphdr *uh = udp_hdr(skb);
>>
>> -       if (nocheck)
>> +       if (nocheck) {
>>                 uh->check = 0;
>> -       else if (skb_is_gso(skb))
>> +       } else if (skb_is_gso(skb)) {
>>                 uh->check = ~udp_v4_check(len, saddr, daddr, 0);
>> -       else if (skb_dst(skb) && skb_dst(skb)->dev &&
>> -                (skb_dst(skb)->dev->features &
>> -                 (NETIF_F_IP_CSUM | NETIF_F_HW_CSUM))) {
>> -
>> -               BUG_ON(skb->ip_summed == CHECKSUM_PARTIAL);
>> +       } else if (skb->ip_summed == CHECKSUM_PARTIAL) {
>> +               __wsum csum;
>>
> I wonder if this shouldn't be made a check that is in addition to the
> two options below instead of completely replacing them.  The question
> I would have is if there are any cases where we need to follow the
> path that results in the CHECKSUM_UNNECESSARY being set.
I don't think there can be such a case.
Either: we've already set up PARTIAL for an inner header, so we can
 definitely do LCO.
Or: we haven't set up PARTIAL yet, so we can use that now.  If the device
 doesn't support it, it'll get fixed up later when we validate_xmit_skb().
So there's no way (AFAICT) that we'd ever not be able to use PARTIAL.
Unless - hmmm - what happens if we've set up PARTIAL for a CRC rather than
 an IP checksum?  However, it looks to me as if in that case the old code
 would have screwed up when iptunnel_handle_offloads() would do the inner
 csum in skb_checksum_help() and would do it as an IP checksum.  So I'm
 guessing this probably can't happen.  Or it's already broken and so my
 patch won't make it any worse ;)

However, the next version of the patch series will split this change out
 from the rest of the patch, as Tom Herbert suggested.
>
>> +               uh->check = ~udp_v4_check(len, saddr, daddr, 0);
>> +               csum = lco_csum(skb);
>> +               uh->check = csum_fold(csum);
>> +               if (uh->check == 0)
>> +                       uh->check = CSUM_MANGLED_0;
>
> You would probably benefit from reordering this to something like what
> we have in the last block below this one.  The idea is then you only
> halve to fold things once and can avoid some unnecessary duplication.
>
> So you could code it up with something like:
>   __wsum csum;
>   int start_offset;
>
>   start_offset = skb_checksum_start_offset(skb);
>   uh->check = ~(*(__sum16 *)(skb->data + start_offset + skb->csum_offset));
>   csum = skb_checksum(skb, 0, start_offset, 0);
>   uh->check = udp_v4_check(len, saddr, daddr, csum);
>   if (uh->check == 0)
>     uh->check = CSUM_MANGLED_0;
>
> Forgive the formatting, my email client mangles tabs badly.  By using
> the pseudo header checksum from the inner header for the starting
> value and then computing the udp_v4_check for the outer header last
> you save yourself a few cycles since you only have to fold the
> checksum once instead of once for the pseudo-header and again for the
> final result.
Hmm, I think we can do this without losing the helper function (which will
 be shared not just by UDPv4 and UDPv6 but also GRE).
Something like this:
  uh->check = 0;
  uh->check = ~udp_v4_check(len, saddr, daddr, lco_csum(skb));
  if (uh->check == 0)
    uh->check = CSUM_MANGLED_0;
Now the only fold is the one udp_v4_check() does.
Would that shave off enough cycles to satisfy?