lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4eb6bf799d5848e6829a89bae96c359e@AcuMS.aculab.com>
Date:   Wed, 4 Dec 2019 10:06:42 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Peter Zijlstra' <peterz@...radead.org>
CC:     linux-kernel <linux-kernel@...r.kernel.org>,
        "x86@...nel.org" <x86@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: RE: [PATCH] x86: Optimise x86 IP checksum code

From: Peter Zijlstra
> Sent: 04 December 2019 09:15
> On Tue, Dec 03, 2019 at 11:52:09AM +0000, David Laight wrote:
> 
> > I did get about 12 bytes/clock using adox/adcx but that would need run-time
> > patching and some AMD cpu that support the instructions run them very slowly.
> 
> Isn't that was we have alternative_call() for?

You'd need to do a run-time check even if the instructions are supported.

Getting the ad[oc]x loop to work is a lot of effort for little gain.
I only tested the loop, not the alignment code - which is tricky since
the loop needs significant unrolling (on Intel cpu adc and jmp need ports
0 or 5 - so you can only do two per clock).
It might be worth doing it on AMD Ryzen where you can use the 'loop'
instruction - but then you'd need to setup multiple base registers and
would be processing memory backwards (loses prefetches).

Quite likely you'd need a reasonably long buffer to get any benefit.
(a few kb at least).

In any case, even in 2004 (the last time this code was changed in git)
it was pointed out that performance isn't that critical.
Interestingly in 2004 only AMD cpus were likely to run the adc chain
at 1 instruction/clock - all the intel ones took 2.
4 bytes/clock can be trivially achieved in C by adding 32 bit words
to a 64 bit register.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ