[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <063D6719AE5E284EB5DD2968C1650D6D1CCBE614@AcuExch.aculab.com>
Date: Wed, 6 Jan 2016 14:49:14 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Eric Dumazet' <eric.dumazet@...il.com>
CC: Tom Herbert <tom@...bertland.com>,
"davem@...emloft.net" <davem@...emloft.net>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"kernel-team@...com" <kernel-team@...com>,
"tglx@...utronix.de" <tglx@...utronix.de>,
"mingo@...hat.com" <mingo@...hat.com>,
"hpa@...or.com" <hpa@...or.com>, "x86@...nel.org" <x86@...nel.org>
Subject: RE: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64
From: Eric Dumazet
> Sent: 06 January 2016 14:25
> On Wed, 2016-01-06 at 10:16 +0000, David Laight wrote:
> > From: Eric Dumazet
> > > Sent: 05 January 2016 22:19
> > > To: Tom Herbert
> > > You might add a comment telling the '4' comes from length of 'adcq
> > > 6*8(%rdi),%rax' instruction, and that the 'nop' is to compensate that
> > > 'adcq 0*8(%rdi),%rax' is using 3 bytes instead.
> > >
> > > We also could use .byte 0x48, 0x13, 0x47, 0x00 to force a 4 bytes
> > > instruction and remove the nop.
> > >
> > >
> > > + lea 20f(, %rcx, 4), %r11
> > > + clc
> > > + jmp *%r11
> > > +
> > > +.align 8
> > > + adcq 6*8(%rdi),%rax
> > > + adcq 5*8(%rdi),%rax
> > > + adcq 4*8(%rdi),%rax
> > > + adcq 3*8(%rdi),%rax
> > > + adcq 2*8(%rdi),%rax
> > > + adcq 1*8(%rdi),%rax
> > > + adcq 0*8(%rdi),%rax // could force a 4 byte instruction (.byte 0x48, 0x13, 0x47, 0x00)
> > > + nop
> > > +20: /* #quads % 8 jump table base */
> >
> > Or move label at the top (after the .align) and adjust the maths.
> > You could add a second label after the first adcq and use the
> > difference between them for the '4'.
>
> Not really.
>
> We could not use the trick it the length was 5.
>
> Only 1, 2, 4 and 8 are supported.
Indeed, and 'lea 20f(, %rcx, 5), %r11' will generate an error from the
assembler.
Seems appropriate to get the assembler to verify this for you.
Assuming this code block is completely skipped for aligned lengths
the nop isn't needed provided the '20:' label is at the right place.
Someone also pointed out that the code is memory limited (dual add
chains making no difference), so why is it unrolled at all?
OTOH I'm sure I remember something about false dependencies on the
x86 flags register because of instructions only changing some bits.
So it might be that you can't (or couldn't) get concurrency between
instructions that update different parts of the flags register.
David
Powered by blists - more mailing lists