[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 3 Mar 2020 14:45:12 +0530
From: Yadu Kishore <kyk.segfault@...il.com>
To: David Laight <David.Laight@...lab.com>
Cc: David Miller <davem@...emloft.net>,
Willem de Bruijn <willemdebruijn.kernel@...il.com>,
Network Development <netdev@...r.kernel.org>
Subject: Re: [PATCH v2] net: Make skb_segment not to compute checksum if
network controller supports checksumming
Hi David,
Thanks for the response.
> I was writing a reply about how 'horrid' the asm in csum_partial_copy_generic is.
> (My guess is no better than 3 clocks for 16 bytes - and a 'rep movs' copy
> could be a lot faster.)
> However the big difference in the times for do_csum().
>
> The fastest loop I can find for a non-copying checksum is:
> Which is twice as fast as the one in csum-partial_64.c for
> most cpu types (especially pre-Haswell ones).
The perf data I presented was collected on an arm64 platform (hikey960) where
the do_csum implementation that is called is not in assembly but in C
(lib/checksum.c)
Thanks,
Yadu Kishore
On Mon, Mar 2, 2020 at 8:49 PM David Laight <David.Laight@...lab.com> wrote:
>
> From: Yadu Kishore
> > Sent: 02 March 2020 08:43
> >
> > On Mon, Mar 2, 2020 at 12:22 PM Yadu Kishore <kyk.segfault@...il.com> wrote:
> > >
> > > > > Can you contrast this against a run with your changes? The thought is
> > > > > that the majority of this cost is due to the memory loads and stores, not
> > > > > the arithmetic ops to compute the checksum. When enabling checksum
> > > > > offload, the same stalls will occur, but will simply be attributed to
> > > > > memcpy instead of to do_csum.
> > >
> > > > Agreed.
> > >
> > > Below is the data from perf with and without the patch for the same
> > > TCP Tx iperf run: (network driver has NETIF_F_HW_CSUM enabled)
> > >
> > A small correction to the data I sent earlier:
> >
> > Without patch :-
> > ============
> > [Function = %cpu cycles]
> > skb_mac_gso_segment = 0.05
> > inet_gso_segment = 0.26
> > tcp4_gso_segment = 0.05
> > tcp_gso_segment = 0.17
> > skb_segment = 0.55
> > skb_copy_and_csum_bits = 0.60
> > do_csum = 7.43
> > memcpy = 3.81
> > __alloc_skb = 0.93
> > ==================
> > SUM = 13.85
> >
> > With patch :-
> > ============
> > [Function = %cpu cycles]
> > skb_mac_gso_segment = 0.05
> > inet_gso_segment = 0.34
> > tcp4_gso_segment = 0.06
> > tcp_gso_segment = 0.26
> > skb_segment = 0.55
> > ** skb_copy_bits = 0.62 ** <-- corrected
> > do_csum = 0.04
> > memcpy = 4.29
> > __alloc_skb = 0.73
> > ==================
> > ** SUM = 6.94 ** <-- corrected
>
> I was writing a reply about how 'horrid' the asm in csum_partial_copy_generic is.
> (My guess is no better than 3 clocks for 16 bytes - and a 'rep movs' copy
> could be a lot faster.)
> However the big difference in the times for do_csum().
>
> The fastest loop I can find for a non-copying checksum is:
> + /*
> + * Align the byte count to a multiple of 16 then
> + * add 64 bit words to alternating registers.
> + * Finally reduce to 64 bits.
> + */
> + asm( " bt $4, %[len]\n"
> + " jnc 10f\n"
> + " add (%[buff], %[len]), %[sum_0]\n"
> + " adc 8(%[buff], %[len]), %[sum_1]\n"
> + " lea 16(%[len]), %[len]\n"
> + "10: jecxz 20f\n"
> + " adc (%[buff], %[len]), %[sum_0]\n"
> + " adc 8(%[buff], %[len]), %[sum_1]\n"
> + " lea 32(%[len]), %[len_tmp]\n"
> + " adc 16(%[buff], %[len]), %[sum_0]\n"
> + " adc 24(%[buff], %[len]), %[sum_1]\n"
> + " mov %[len_tmp], %[len]\n"
> + " jmp 10b\n"
> + "20: adc %[sum_0], %[sum]\n"
> + " adc %[sum_1], %[sum]\n"
> + " adc $0, %[sum]\n"
> + : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
> + [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
> + : [buff] "r" (buff)
> + : "memory" );
>
> (I need to resubmit that patch)
>
> Which is twice as fast as the one in csum-partial_64.c for
> most cpu types (especially pre-Haswell ones).
>
> IIRC that does 7 bytes/clock on my ivy bridge and
> one 8 bytes/clock on Haswell.
> Nothing prior to sandy bridge can beat adding 32bit words
> to a 64bit register!
> The faffing with 'len_tmp' made a marginal difference.
>
> You can't go faster than 1 read/clock without using ad[oc]x.
> The loop above does work (both carry and overflow are preserved)
> but it needs further unrolling (and more edge code) and only
> runs on a few cpu.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
Powered by blists - more mailing lists