[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170720152749.k7al6xsvckczolzi@hirez.programming.kicks-ass.net>
Date: Thu, 20 Jul 2017 17:27:49 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Anshul Garg <aksgarg1989@...il.com>,
Davidlohr Bueso <dave@...olabs.net>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
anshul.g@...sung.com, Thomas Gleixner <tglx@...utronix.de>,
joe@...ches.com
Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function
On Thu, Jul 20, 2017 at 01:24:49PM +0200, Peter Zijlstra wrote:
> ~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 -DANSHUL=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt
>
> Performance counter stats for './sqrt' (10 runs):
>
> 328,415,775 cycles:u ( +- 0.15% )
> 1,138,579,704 instructions:u # 3.47 insn per cycle ( +- 0.00% )
>
> 0.088703205 seconds time elapsed
> static __always_inline unsigned long fls(unsigned long word)
> {
> asm("rep; bsr %1,%0"
> : "=r" (word)
> : "rm" (word));
> return BITS_PER_LONG - 1 - word;
> }
That is actually "lzcnt", if I used the regular fls implementation:
static __always_inline unsigned long __fls(unsigned long word)
{
asm("bsr %1,%0"
: "=r" (word)
: "rm" (word));
return word;
}
It ends up slightly more expensive:
~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 -DANSHUL=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt
Performance counter stats for './sqrt' (10 runs):
384,842,215 cycles:u ( +- 0.08% )
1,118,579,712 instructions:u # 2.91 insn per cycle ( +- 0.00% )
0.103018001 seconds time elapsed
Still loads cheaper than pretty much any other combination.
Powered by blists - more mailing lists