[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4eb6bf799d5848e6829a89bae96c359e@AcuMS.aculab.com>
Date: Wed, 4 Dec 2019 10:06:42 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Peter Zijlstra' <peterz@...radead.org>
CC: linux-kernel <linux-kernel@...r.kernel.org>,
"x86@...nel.org" <x86@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>
Subject: RE: [PATCH] x86: Optimise x86 IP checksum code
From: Peter Zijlstra
> Sent: 04 December 2019 09:15
> On Tue, Dec 03, 2019 at 11:52:09AM +0000, David Laight wrote:
>
> > I did get about 12 bytes/clock using adox/adcx but that would need run-time
> > patching and some AMD cpu that support the instructions run them very slowly.
>
> Isn't that was we have alternative_call() for?
You'd need to do a run-time check even if the instructions are supported.
Getting the ad[oc]x loop to work is a lot of effort for little gain.
I only tested the loop, not the alignment code - which is tricky since
the loop needs significant unrolling (on Intel cpu adc and jmp need ports
0 or 5 - so you can only do two per clock).
It might be worth doing it on AMD Ryzen where you can use the 'loop'
instruction - but then you'd need to setup multiple base registers and
would be processing memory backwards (loses prefetches).
Quite likely you'd need a reasonably long buffer to get any benefit.
(a few kb at least).
In any case, even in 2004 (the last time this code was changed in git)
it was pointed out that performance isn't that critical.
Interestingly in 2004 only AMD cpus were likely to run the adc chain
at 1 instruction/clock - all the intel ones took 2.
4 bytes/clock can be trivially achieved in C by adding 32 bit words
to a 64 bit register.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists