[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <945f6cafc86b4f1bb18fa40e60d5c113@AcuMS.aculab.com>
Date: Mon, 2 Mar 2020 15:19:29 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Yadu Kishore' <kyk.segfault@...il.com>,
David Miller <davem@...emloft.net>
CC: Willem de Bruijn <willemdebruijn.kernel@...il.com>,
Network Development <netdev@...r.kernel.org>
Subject: RE: [PATCH v2] net: Make skb_segment not to compute checksum if
network controller supports checksumming
From: Yadu Kishore
> Sent: 02 March 2020 08:43
>
> On Mon, Mar 2, 2020 at 12:22 PM Yadu Kishore <kyk.segfault@...il.com> wrote:
> >
> > > > Can you contrast this against a run with your changes? The thought is
> > > > that the majority of this cost is due to the memory loads and stores, not
> > > > the arithmetic ops to compute the checksum. When enabling checksum
> > > > offload, the same stalls will occur, but will simply be attributed to
> > > > memcpy instead of to do_csum.
> >
> > > Agreed.
> >
> > Below is the data from perf with and without the patch for the same
> > TCP Tx iperf run: (network driver has NETIF_F_HW_CSUM enabled)
> >
> A small correction to the data I sent earlier:
>
> Without patch :-
> ============
> [Function = %cpu cycles]
> skb_mac_gso_segment = 0.05
> inet_gso_segment = 0.26
> tcp4_gso_segment = 0.05
> tcp_gso_segment = 0.17
> skb_segment = 0.55
> skb_copy_and_csum_bits = 0.60
> do_csum = 7.43
> memcpy = 3.81
> __alloc_skb = 0.93
> ==================
> SUM = 13.85
>
> With patch :-
> ============
> [Function = %cpu cycles]
> skb_mac_gso_segment = 0.05
> inet_gso_segment = 0.34
> tcp4_gso_segment = 0.06
> tcp_gso_segment = 0.26
> skb_segment = 0.55
> ** skb_copy_bits = 0.62 ** <-- corrected
> do_csum = 0.04
> memcpy = 4.29
> __alloc_skb = 0.73
> ==================
> ** SUM = 6.94 ** <-- corrected
I was writing a reply about how 'horrid' the asm in csum_partial_copy_generic is.
(My guess is no better than 3 clocks for 16 bytes - and a 'rep movs' copy
could be a lot faster.)
However the big difference in the times for do_csum().
The fastest loop I can find for a non-copying checksum is:
+ /*
+ * Align the byte count to a multiple of 16 then
+ * add 64 bit words to alternating registers.
+ * Finally reduce to 64 bits.
+ */
+ asm( " bt $4, %[len]\n"
+ " jnc 10f\n"
+ " add (%[buff], %[len]), %[sum_0]\n"
+ " adc 8(%[buff], %[len]), %[sum_1]\n"
+ " lea 16(%[len]), %[len]\n"
+ "10: jecxz 20f\n"
+ " adc (%[buff], %[len]), %[sum_0]\n"
+ " adc 8(%[buff], %[len]), %[sum_1]\n"
+ " lea 32(%[len]), %[len_tmp]\n"
+ " adc 16(%[buff], %[len]), %[sum_0]\n"
+ " adc 24(%[buff], %[len]), %[sum_1]\n"
+ " mov %[len_tmp], %[len]\n"
+ " jmp 10b\n"
+ "20: adc %[sum_0], %[sum]\n"
+ " adc %[sum_1], %[sum]\n"
+ " adc $0, %[sum]\n"
+ : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
+ [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
+ : [buff] "r" (buff)
+ : "memory" );
(I need to resubmit that patch)
Which is twice as fast as the one in csum-partial_64.c for
most cpu types (especially pre-Haswell ones).
IIRC that does 7 bytes/clock on my ivy bridge and
one 8 bytes/clock on Haswell.
Nothing prior to sandy bridge can beat adding 32bit words
to a 64bit register!
The faffing with 'len_tmp' made a marginal difference.
You can't go faster than 1 read/clock without using ad[oc]x.
The loop above does work (both carry and overflow are preserved)
but it needs further unrolling (and more edge code) and only
runs on a few cpu.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists