lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALx6S37fGc0-JBVCYT0zh1PMGWkkw--YaqRyUU8tFNMLEn7xFQ@mail.gmail.com>
Date:	Thu, 4 Feb 2016 12:59:30 -0800
From:	Tom Herbert <tom@...bertland.com>
To:	David Laight <David.Laight@...lab.com>
Cc:	Alexander Duyck <alexander.duyck@...il.com>,
	"davem@...emloft.net" <davem@...emloft.net>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"tglx@...utronix.de" <tglx@...utronix.de>,
	"mingo@...hat.com" <mingo@...hat.com>,
	"hpa@...or.com" <hpa@...or.com>, "x86@...nel.org" <x86@...nel.org>,
	"kernel-team@...com" <kernel-team@...com>
Subject: Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

On Thu, Feb 4, 2016 at 9:09 AM, David Laight <David.Laight@...lab.com> wrote:
> From: Tom Herbert
> ...
>> > If nothing else reducing the size of this main loop may be desirable.
>> > I know the newer x86 is supposed to have a loop buffer so that it can
>> > basically loop on already decoded instructions.  Normally it is only
>> > something like 64 or 128 bytes in size though.  You might find that
>> > reducing this loop to that smaller size may improve the performance
>> > for larger payloads.
>>
>> I saw 128 to be better in my testing. For large packets this loop does
>> all the work. I see performance dependent on the amount of loop
>> overhead, i.e. we got it down to two non-adcq instructions but it is
>> still noticeable. Also, this helps a lot on sizes up to 128 bytes
>> since we only need to do single call in the jump table and no trip
>> through the loop.
>
> But one of your 'loop overhead' instructions is 'loop'.
> Look at http://www.agner.org/optimize/instruction_tables.pdf
> you don't want to be using 'loop' on intel cpus.
>
I'm not following. We can replace loop with decl %ecx and jg, but why
is that better?

Tom

> You might get some benefit from pipelining the loop (so you do
> a read to register in one iteration and a register-register adc
> the next).
>
>         David
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ