[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 3 Mar 2016 16:12:16 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Tom Herbert' <tom@...bertland.com>,
"davem@...emloft.net" <davem@...emloft.net>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
CC: "torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>,
"tglx@...utronix.de" <tglx@...utronix.de>,
"mingo@...hat.com" <mingo@...hat.com>,
"hpa@...or.com" <hpa@...or.com>, "x86@...nel.org" <x86@...nel.org>,
"kernel-team@...com" <kernel-team@...com>
Subject: RE: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64
From: Tom Herbert
> Sent: 02 March 2016 22:19
...
> + /* Main loop using 64byte blocks */
> + for (; len > 64; len -= 64, buff += 64) {
> + asm("addq 0*8(%[src]),%[res]\n\t"
> + "adcq 1*8(%[src]),%[res]\n\t"
> + "adcq 2*8(%[src]),%[res]\n\t"
> + "adcq 3*8(%[src]),%[res]\n\t"
> + "adcq 4*8(%[src]),%[res]\n\t"
> + "adcq 5*8(%[src]),%[res]\n\t"
> + "adcq 6*8(%[src]),%[res]\n\t"
> + "adcq 7*8(%[src]),%[res]\n\t"
> + "adcq $0,%[res]"
> + : [res] "=r" (result)
> + : [src] "r" (buff),
> + "[res]" (result));
Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..'
without any unrolling?
...
> + /* Sum over any remaining bytes (< 8 of them) */
> + if (len & 0x7) {
> + unsigned long val;
> + /*
> + * Since "len" is > 8 here we backtrack in the buffer to load
> + * the outstanding bytes into the low order bytes of a quad and
> + * then shift to extract the relevant bytes. By doing this we
> + * avoid additional calls to load_unaligned_zeropad.
That comment is wrong. Maybe:
* Read the last 8 bytes of the buffer then shift to extract
* the required bytes.
* This is safe because the original length was > 8 and avoids
* any problems reading beyond the end of the valid data.
David
Powered by blists - more mailing lists