netdev - Re: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABGOaVQMq-AxwQOJ5DdDY6yLBOXqBg6G7qC_MdOYj_z4y-QQiw@mail.gmail.com>
Date:   Tue, 3 Mar 2020 14:45:12 +0530
From:   Yadu Kishore <kyk.segfault@...il.com>
To:     David Laight <David.Laight@...lab.com>
Cc:     David Miller <davem@...emloft.net>,
        Willem de Bruijn <willemdebruijn.kernel@...il.com>,
        Network Development <netdev@...r.kernel.org>
Subject: Re: [PATCH v2] net: Make skb_segment not to compute checksum if
 network controller supports checksumming

Hi David,

Thanks for the response.

> I was writing a reply about how 'horrid' the asm in csum_partial_copy_generic is.
> (My guess is no better than 3 clocks for 16 bytes - and a 'rep movs' copy
> could be a lot faster.)
> However the big difference in the times for do_csum().
>
> The fastest loop I can find for a non-copying checksum is:
> Which is twice as fast as the one in csum-partial_64.c for
> most cpu types (especially pre-Haswell ones).

The perf data I presented was collected on an arm64 platform (hikey960) where
the do_csum implementation that is called is not in assembly but in C
(lib/checksum.c)

Thanks,
Yadu Kishore

On Mon, Mar 2, 2020 at 8:49 PM David Laight <David.Laight@...lab.com> wrote:
>
> From: Yadu Kishore
> > Sent: 02 March 2020 08:43
> >
> > On Mon, Mar 2, 2020 at 12:22 PM Yadu Kishore <kyk.segfault@...il.com> wrote:
> > >
> > > > > Can you contrast this against a run with your changes? The thought is
> > > > > that the majority of this cost is due to the memory loads and stores, not
> > > > > the arithmetic ops to compute the checksum. When enabling checksum
> > > > > offload, the same stalls will occur, but will simply be attributed to
> > > > > memcpy instead of to do_csum.
> > >
> > > > Agreed.
> > >
> > > Below is the data from perf with and without the patch for the same
> > > TCP Tx iperf run: (network driver has NETIF_F_HW_CSUM enabled)
> > >
> > A small correction to the data I sent earlier:
> >
> > Without patch :-
> >  ============
> > [Function = %cpu cycles]
> > skb_mac_gso_segment = 0.05
> > inet_gso_segment = 0.26
> > tcp4_gso_segment = 0.05
> > tcp_gso_segment = 0.17
> > skb_segment = 0.55
> > skb_copy_and_csum_bits = 0.60
> > do_csum = 7.43
> > memcpy = 3.81
> > __alloc_skb = 0.93
> > ==================
> > SUM = 13.85
> >
> > With patch :-
> > ============
> > [Function = %cpu cycles]
> > skb_mac_gso_segment = 0.05
> > inet_gso_segment = 0.34
> > tcp4_gso_segment = 0.06
> > tcp_gso_segment = 0.26
> > skb_segment = 0.55
> > ** skb_copy_bits = 0.62 **  <-- corrected
> > do_csum = 0.04
> > memcpy = 4.29
> > __alloc_skb = 0.73
> > ==================
> > ** SUM = 6.94 ** <-- corrected
>
> I was writing a reply about how 'horrid' the asm in csum_partial_copy_generic is.
> (My guess is no better than 3 clocks for 16 bytes - and a 'rep movs' copy
> could be a lot faster.)
> However the big difference in the times for do_csum().
>
> The fastest loop I can find for a non-copying checksum is:
> +       /*
> +        * Align the byte count to a multiple of 16 then
> +        * add 64 bit words to alternating registers.
> +        * Finally reduce to 64 bits.
> +        */
> +       asm(    "       bt    $4, %[len]\n"
> +               "       jnc   10f\n"
> +               "       add   (%[buff], %[len]), %[sum_0]\n"
> +               "       adc   8(%[buff], %[len]), %[sum_1]\n"
> +               "       lea   16(%[len]), %[len]\n"
> +               "10:    jecxz 20f\n"
> +               "       adc   (%[buff], %[len]), %[sum_0]\n"
> +               "       adc   8(%[buff], %[len]), %[sum_1]\n"
> +               "       lea   32(%[len]), %[len_tmp]\n"
> +               "       adc   16(%[buff], %[len]), %[sum_0]\n"
> +               "       adc   24(%[buff], %[len]), %[sum_1]\n"
> +               "       mov   %[len_tmp], %[len]\n"
> +               "       jmp   10b\n"
> +               "20:    adc   %[sum_0], %[sum]\n"
> +               "       adc   %[sum_1], %[sum]\n"
> +               "       adc   $0, %[sum]\n"
> +           : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
> +               [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
> +           : [buff] "r" (buff)
> +           : "memory" );
>
> (I need to resubmit that patch)
>
> Which is twice as fast as the one in csum-partial_64.c for
> most cpu types (especially pre-Haswell ones).
>
> IIRC that does 7 bytes/clock on my ivy bridge and
> one 8 bytes/clock on Haswell.
> Nothing prior to sandy bridge can beat adding 32bit words
> to a 64bit register!
> The faffing with 'len_tmp' made a marginal difference.
>
> You can't go faster than 1 read/clock without using ad[oc]x.
> The loop above does work (both carry and overflow are preserved)
> but it needs further unrolling (and more edge code) and only
> runs on a few cpu.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)