netdev - RE: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <945f6cafc86b4f1bb18fa40e60d5c113@AcuMS.aculab.com>
Date:   Mon, 2 Mar 2020 15:19:29 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Yadu Kishore' <kyk.segfault@...il.com>,
        David Miller <davem@...emloft.net>
CC:     Willem de Bruijn <willemdebruijn.kernel@...il.com>,
        Network Development <netdev@...r.kernel.org>
Subject: RE: [PATCH v2] net: Make skb_segment not to compute checksum if
 network controller supports checksumming

From: Yadu Kishore
> Sent: 02 March 2020 08:43
> 
> On Mon, Mar 2, 2020 at 12:22 PM Yadu Kishore <kyk.segfault@...il.com> wrote:
> >
> > > > Can you contrast this against a run with your changes? The thought is
> > > > that the majority of this cost is due to the memory loads and stores, not
> > > > the arithmetic ops to compute the checksum. When enabling checksum
> > > > offload, the same stalls will occur, but will simply be attributed to
> > > > memcpy instead of to do_csum.
> >
> > > Agreed.
> >
> > Below is the data from perf with and without the patch for the same
> > TCP Tx iperf run: (network driver has NETIF_F_HW_CSUM enabled)
> >
> A small correction to the data I sent earlier:
> 
> Without patch :-
>  ============
> [Function = %cpu cycles]
> skb_mac_gso_segment = 0.05
> inet_gso_segment = 0.26
> tcp4_gso_segment = 0.05
> tcp_gso_segment = 0.17
> skb_segment = 0.55
> skb_copy_and_csum_bits = 0.60
> do_csum = 7.43
> memcpy = 3.81
> __alloc_skb = 0.93
> ==================
> SUM = 13.85
> 
> With patch :-
> ============
> [Function = %cpu cycles]
> skb_mac_gso_segment = 0.05
> inet_gso_segment = 0.34
> tcp4_gso_segment = 0.06
> tcp_gso_segment = 0.26
> skb_segment = 0.55
> ** skb_copy_bits = 0.62 **  <-- corrected
> do_csum = 0.04
> memcpy = 4.29
> __alloc_skb = 0.73
> ==================
> ** SUM = 6.94 ** <-- corrected

I was writing a reply about how 'horrid' the asm in csum_partial_copy_generic is.
(My guess is no better than 3 clocks for 16 bytes - and a 'rep movs' copy
could be a lot faster.)
However the big difference in the times for do_csum().

The fastest loop I can find for a non-copying checksum is:
+       /*
+        * Align the byte count to a multiple of 16 then
+        * add 64 bit words to alternating registers.
+        * Finally reduce to 64 bits.
+        */
+       asm(    "       bt    $4, %[len]\n"
+               "       jnc   10f\n"
+               "       add   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   16(%[len]), %[len]\n"
+               "10:    jecxz 20f\n"
+               "       adc   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   32(%[len]), %[len_tmp]\n"
+               "       adc   16(%[buff], %[len]), %[sum_0]\n"
+               "       adc   24(%[buff], %[len]), %[sum_1]\n"
+               "       mov   %[len_tmp], %[len]\n"
+               "       jmp   10b\n"
+               "20:    adc   %[sum_0], %[sum]\n"
+               "       adc   %[sum_1], %[sum]\n"
+               "       adc   $0, %[sum]\n"
+           : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
+               [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
+           : [buff] "r" (buff)
+           : "memory" );

(I need to resubmit that patch)

Which is twice as fast as the one in csum-partial_64.c for
most cpu types (especially pre-Haswell ones).

IIRC that does 7 bytes/clock on my ivy bridge and
one 8 bytes/clock on Haswell.
Nothing prior to sandy bridge can beat adding 32bit words
to a 64bit register!
The faffing with 'len_tmp' made a marginal difference.

You can't go faster than 1 read/clock without using ad[oc]x.
The loop above does work (both carry and overflow are preserved)
but it needs further unrolling (and more edge code) and only
runs on a few cpu.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)