lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 6 Jan 2016 14:49:14 +0000
From:	David Laight <David.Laight@...LAB.COM>
To:	'Eric Dumazet' <eric.dumazet@...il.com>
CC:	Tom Herbert <tom@...bertland.com>,
	"davem@...emloft.net" <davem@...emloft.net>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"kernel-team@...com" <kernel-team@...com>,
	"tglx@...utronix.de" <tglx@...utronix.de>,
	"mingo@...hat.com" <mingo@...hat.com>,
	"hpa@...or.com" <hpa@...or.com>, "x86@...nel.org" <x86@...nel.org>
Subject: RE: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

From: Eric Dumazet
> Sent: 06 January 2016 14:25
> On Wed, 2016-01-06 at 10:16 +0000, David Laight wrote:
> > From: Eric Dumazet
> > > Sent: 05 January 2016 22:19
> > > To: Tom Herbert
> > > You might add a comment telling the '4' comes from length of 'adcq
> > > 6*8(%rdi),%rax' instruction, and that the 'nop' is to compensate that
> > > 'adcq    0*8(%rdi),%rax' is using 3 bytes instead.
> > >
> > > We also could use .byte 0x48, 0x13, 0x47, 0x00 to force a 4 bytes
> > > instruction and remove the nop.
> > >
> > >
> > > +       lea     20f(, %rcx, 4), %r11
> > > +       clc
> > > +       jmp     *%r11
> > > +
> > > +.align 8
> > > +       adcq    6*8(%rdi),%rax
> > > +       adcq    5*8(%rdi),%rax
> > > +       adcq    4*8(%rdi),%rax
> > > +       adcq    3*8(%rdi),%rax
> > > +       adcq    2*8(%rdi),%rax
> > > +       adcq    1*8(%rdi),%rax
> > > +       adcq    0*8(%rdi),%rax // could force a 4 byte instruction (.byte 0x48, 0x13, 0x47, 0x00)
> > > +       nop
> > > +20:    /* #quads % 8 jump table base */
> >
> > Or move label at the top (after the .align) and adjust the maths.
> > You could add a second label after the first adcq and use the
> > difference between them for the '4'.
> 
> Not really.
> 
> We could not use the trick it the length was 5.
> 
> Only 1, 2, 4 and 8 are supported.

Indeed, and 'lea  20f(, %rcx, 5), %r11' will generate an error from the
assembler.
Seems appropriate to get the assembler to verify this for you.

Assuming this code block is completely skipped for aligned lengths
the nop isn't needed provided the '20:' label is at the right place.

Someone also pointed out that the code is memory limited (dual add
chains making no difference), so why is it unrolled at all?

OTOH I'm sure I remember something about false dependencies on the
x86 flags register because of instructions only changing some bits.
So it might be that you can't (or couldn't) get concurrency between
instructions that update different parts of the flags register.

	David


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ