[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <88d43a6e2a7b497b9b1e448bf7ae068a@AcuMS.aculab.com>
Date: Mon, 23 Mar 2020 09:15:45 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Tom Herbert' <tom@...bertland.com>
CC: Eric Dumazet <eric.dumazet@...il.com>,
Willem de Bruijn <willemdebruijn.kernel@...il.com>,
Yadu Kishore <kyk.segfault@...il.com>,
David Miller <davem@...emloft.net>,
Network Development <netdev@...r.kernel.org>
Subject: RE: [PATCH v2] net: Make skb_segment not to compute checksum if
network controller supports checksumming
From: Tom Herbert
> Sent: 22 March 2020 19:54
...
> A while back, I had proposed an optimized x86 checksum function using
> unrolled addq a while back https://lwn.net/Articles/679137/. Also,
> this tries to optimize from small checksum like over header when doing
> skb_postpull_rcsum.
Look for the patch I submitted a few weeks back.
Attached, loop body is:
+ asm( " bt $4, %[len]\n"
+ " jnc 10f\n"
+ " add (%[buff], %[len]), %[sum_0]\n"
+ " adc 8(%[buff], %[len]), %[sum_1]\n"
+ " lea 16(%[len]), %[len]\n"
+ "10: jecxz 20f\n"
+ " adc (%[buff], %[len]), %[sum_0]\n"
+ " adc 8(%[buff], %[len]), %[sum_1]\n"
+ " lea 32(%[len]), %[len_tmp]\n"
+ " adc 16(%[buff], %[len]), %[sum_0]\n"
+ " adc 24(%[buff], %[len]), %[sum_1]\n"
+ " mov %[len_tmp], %[len]\n"
+ " jmp 10b\n"
+ "20: adc %[sum_0], %[sum]\n"
+ " adc %[sum_1], %[sum]\n"
+ " adc $0, %[sum]\n"
+ : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
+ [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
+ : [buff] "r" (buff)
+ : "memory" );
You don't need a very unrolled loop - just a loop that doesn't
have dependencies on parts of the flags register
(in particular dont require 'dec' to preserve carry).
You also need to adc to alternating destinations for some chips.
With the fast 'rep movs' on modern cpu I suspect that trying
to do the copy at the same time as the checksum is a loss.
The 'adc memory' is one instruction and 2-uops.
Fairly easy to execute a 1/clock.
IIRC on Haswell you don't need to unroll the loop at all.
(You can use dec and adc).
A checksum+copy loop is load+adc+store which is 3 instructions
and 5 uops - starts hitting the limits on uops/clock
and retirement, so won't run in 1 clock and the loop will
always be extra.
Whereas a 'rep movs' is likely to move 1 cache line/clock.
ISTR 'rep movq' is faster than 8 bytes/clock on my 'ivy bridge'.
On Haswell you can approach 16 bytes/clock using ad[oc]x.
But it is painful, and I'm not sure that all cpu that support
the instructions actually execute them quickly.
If you can't use adc (eg in C or MIPS) there are two
obvious solutions:
1) Add 32bit words to a 64bit register.
On x86 this is as fast as adc on everything prior to sandy bridge!
2) Add 64bit words to a 64bit reg and (word >> 32) to a second.
Fixup for the missing carry (check high bit of main sum matches
relevant bit of shifted sum).
This could be faster than the horrid arm64 asm code.
1 memory load, 1 add, 1 'shift + add' per 8 bytes.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Download attachment "0001-Optimise-x86-IP-checksum-code.patch" of type "application/octet-stream" (7395 bytes)
Powered by blists - more mailing lists