netdev - RE: [PATCH v2 net-next] net: Implement fast csum_partial for x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <063D6719AE5E284EB5DD2968C1650D6D1CCBE614@AcuExch.aculab.com>
Date:	Wed, 6 Jan 2016 14:49:14 +0000
From:	David Laight <David.Laight@...LAB.COM>
To:	'Eric Dumazet' <eric.dumazet@...il.com>
CC:	Tom Herbert <tom@...bertland.com>,
	"davem@...emloft.net" <davem@...emloft.net>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"kernel-team@...com" <kernel-team@...com>,
	"tglx@...utronix.de" <tglx@...utronix.de>,
	"mingo@...hat.com" <mingo@...hat.com>,
	"hpa@...or.com" <hpa@...or.com>, "x86@...nel.org" <x86@...nel.org>
Subject: RE: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

From: Eric Dumazet
> Sent: 06 January 2016 14:25
> On Wed, 2016-01-06 at 10:16 +0000, David Laight wrote:
> > From: Eric Dumazet
> > > Sent: 05 January 2016 22:19
> > > To: Tom Herbert
> > > You might add a comment telling the '4' comes from length of 'adcq
> > > 6*8(%rdi),%rax' instruction, and that the 'nop' is to compensate that
> > > 'adcq    0*8(%rdi),%rax' is using 3 bytes instead.
> > >
> > > We also could use .byte 0x48, 0x13, 0x47, 0x00 to force a 4 bytes
> > > instruction and remove the nop.
> > >
> > >
> > > +       lea     20f(, %rcx, 4), %r11
> > > +       clc
> > > +       jmp     *%r11
> > > +
> > > +.align 8
> > > +       adcq    6*8(%rdi),%rax
> > > +       adcq    5*8(%rdi),%rax
> > > +       adcq    4*8(%rdi),%rax
> > > +       adcq    3*8(%rdi),%rax
> > > +       adcq    2*8(%rdi),%rax
> > > +       adcq    1*8(%rdi),%rax
> > > +       adcq    0*8(%rdi),%rax // could force a 4 byte instruction (.byte 0x48, 0x13, 0x47, 0x00)
> > > +       nop
> > > +20:    /* #quads % 8 jump table base */
> >
> > Or move label at the top (after the .align) and adjust the maths.
> > You could add a second label after the first adcq and use the
> > difference between them for the '4'.
> 
> Not really.
> 
> We could not use the trick it the length was 5.
> 
> Only 1, 2, 4 and 8 are supported.

Indeed, and 'lea  20f(, %rcx, 5), %r11' will generate an error from the
assembler.
Seems appropriate to get the assembler to verify this for you.

Assuming this code block is completely skipped for aligned lengths
the nop isn't needed provided the '20:' label is at the right place.

Someone also pointed out that the code is memory limited (dual add
chains making no difference), so why is it unrolled at all?

OTOH I'm sure I remember something about false dependencies on the
x86 flags register because of instructions only changing some bits.
So it might be that you can't (or couldn't) get concurrency between
instructions that update different parts of the flags register.

	David