lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 4 Feb 2016 08:58:27 -0800
From:	Tom Herbert <tom@...bertland.com>
To:	Alexander Duyck <alexander.duyck@...il.com>
Cc:	David Laight <David.Laight@...lab.com>,
	"davem@...emloft.net" <davem@...emloft.net>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"tglx@...utronix.de" <tglx@...utronix.de>,
	"mingo@...hat.com" <mingo@...hat.com>,
	"hpa@...or.com" <hpa@...or.com>, "x86@...nel.org" <x86@...nel.org>,
	"kernel-team@...com" <kernel-team@...com>
Subject: Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

On Thu, Feb 4, 2016 at 8:51 AM, Alexander Duyck
<alexander.duyck@...il.com> wrote:
> On Thu, Feb 4, 2016 at 3:08 AM, David Laight <David.Laight@...lab.com> wrote:
>> From: Tom Herbert
>>> Sent: 03 February 2016 19:19
>> ...
>>> +     /* Main loop */
>>> +50:  adcq    0*8(%rdi),%rax
>>> +     adcq    1*8(%rdi),%rax
>>> +     adcq    2*8(%rdi),%rax
>>> +     adcq    3*8(%rdi),%rax
>>> +     adcq    4*8(%rdi),%rax
>>> +     adcq    5*8(%rdi),%rax
>>> +     adcq    6*8(%rdi),%rax
>>> +     adcq    7*8(%rdi),%rax
>>> +     adcq    8*8(%rdi),%rax
>>> +     adcq    9*8(%rdi),%rax
>>> +     adcq    10*8(%rdi),%rax
>>> +     adcq    11*8(%rdi),%rax
>>> +     adcq    12*8(%rdi),%rax
>>> +     adcq    13*8(%rdi),%rax
>>> +     adcq    14*8(%rdi),%rax
>>> +     adcq    15*8(%rdi),%rax
>>> +     lea     128(%rdi), %rdi
>>> +     loop    50b
>>
>> I'd need convincing that unrolling the loop like that gives any significant gain.
>> You have a dependency chain on the carry flag so have delays between the 'adcq'
>> instructions (these may be more significant than the memory reads from l1 cache).
>>
>> I also don't remember (might be wrong) the 'loop' instruction being executed quickly.
>> If 'loop' is fast then you will probably find that:
>>
>> 10:     adcq 0(%rdi),%rax
>>         lea  8(%rdi),%rdi
>>         loop 10b
>>
>> is just as fast since the three instructions could all be executed in parallel.
>> But I suspect that 'dec %cx; jnz 10b' is actually better (and might execute as
>> a single micro-op).
>> IIRC 'adc' and 'dec' will both have dependencies on the flags register
>> so cannot execute together (which is a shame here).
>>
>> It is also possible that breaking the carry-chain dependency by doing 32bit
>> adds (possibly after 64bit reads) can be made to be faster.
>
> If nothing else reducing the size of this main loop may be desirable.
> I know the newer x86 is supposed to have a loop buffer so that it can
> basically loop on already decoded instructions.  Normally it is only
> something like 64 or 128 bytes in size though.  You might find that
> reducing this loop to that smaller size may improve the performance
> for larger payloads.
>
I saw 128 to be better in my testing. For large packets this loop does
all the work. I see performance dependent on the amount of loop
overhead, i.e. we got it down to two non-adcq instructions but it is
still noticeable. Also, this helps a lot on sizes up to 128 bytes
since we only need to do single call in the jump table and no trip
through the loop.

Tom

> - Alex

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ