lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170721114039.dqip5wj2tha42mol@hirez.programming.kicks-ass.net>
Date:   Fri, 21 Jul 2017 13:40:39 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Anshul Garg <aksgarg1989@...il.com>,
        Davidlohr Bueso <dave@...olabs.net>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        "anshul.g@...sung.com" <anshul.g@...sung.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Joe Perches <joe@...ches.com>, Ingo Molnar <mingo@...nel.org>,
        Will Deacon <will.deacon@....com>
Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Thu, Jul 20, 2017 at 04:24:32PM -0700, Linus Torvalds wrote:

> So mayube something like the attached?
> 
> NOTE NOTE NOTE! This may be pure garbage. It's untested. But the code
> generation looks ok even if gcc is being way too clever and turns that
> first loop into a counted loop instead. Whatever, maybe it's the right
> thing to do.

It passes a -DVALIDATE=1 run (up to a billion or so), so it seems to work.


I've made the test thing a bit more complicated in order to address your
concerns for primed branch predictors (and once the branch predictors
get wiped, the predictable input values don't matter anymore either I
think, although I could easily LFSR generate them as well of course).

Find it attached in a tarball.


The results, from running ./test.sh (left SKL, right SNB):

(values <100k)

Cycles:


EVENT=0 -DLINUS=1
event: 29.533080 +- 0.046261			36.800100 +- 0.046067
EVENT=0 -DLINUS=1 -DWIPE_BTB=1
event: 143.513440 +- 0.152651                   153.054640 +- 0.099403
EVENT=0 -DNEW=1 -DANSHUL=1
event: 41.457300 +- 0.048409                    55.046100 +- 0.044968
EVENT=0 -DNEW=1 -DANSHUL=1 -DWIPE_BTB=1
event: 128.351400 +- 0.366873                   161.183690 +- 0.132301
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1
event: 21.433170 +- 0.043859                    27.672700 +- 0.063982
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 -DWIPE_BTB=1
event: 79.850210 +- 0.133646                    112.422470 +- 0.101465
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1
event: 31.771680 +- 0.037655                    37.515160 +- 0.080305
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 -DWIPE_BTB=1
event: 129.673440 +- 0.430308                   125.514390 +- 0.117590


Branches:


EVENT=4 -DLINUS=1
event: 31.507740 +- 0.010470                    31.507710 +- 0.010470
EVENT=4 -DLINUS=1 -DWIPE_BTB=1
event: 31.507720 +- 0.010470                    31.507760 +- 0.010470
EVENT=4 -DNEW=1 -DANSHUL=1
event: 45.125510 +- 0.002696                    45.125510 +- 0.002696
EVENT=4 -DNEW=1 -DANSHUL=1 -DWIPE_BTB=1
event: 45.125490 +- 0.002696                    45.125540 +- 0.002697
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1
event: 21.252320 +- 0.005266                    21.252330 +- 0.005266
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 -DWIPE_BTB=1
event: 21.252320 +- 0.005266                    21.252340 +- 0.005266
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1
event: 25.727940 +- 0.005975                    25.727930 +- 0.005975
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 -DWIPE_BTB=1
event: 25.727940 +- 0.005975                    25.727930 +- 0.005975


Branch-Misses:


EVENT=5 -DLINUS=1
event: 0.022670 +- 0.000789                     0.020920 +- 0.000812
EVENT=5 -DLINUS=1 -DWIPE_BTB=1
event: 5.481750 +- 0.006930                     5.955130 +- 0.005228
EVENT=5 -DNEW=1 -DANSHUL=1
event: 0.025990 +- 0.000798                     0.017170 +- 0.000690
EVENT=5 -DNEW=1 -DANSHUL=1 -DWIPE_BTB=1
event: 3.343600 +- 0.004249                     5.723570 +- 0.004876
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1
event: 0.019040 +- 0.000731                     0.022570 +- 0.000829
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 -DWIPE_BTB=1
event: 2.156350 +- 0.004643                     4.297650 +- 0.004910
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1
event: 0.019800 +- 0.000747                     0.016780 +- 0.000688
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 -DWIPE_BTB=1
event: 4.481360 +- 0.005505                     4.518300 +- 0.004877




Which still doesn't explain your hatred of fls(), as even with cold
branches and a software fls, its faster than your latest proposal (as
can be explained by the lower total number of branches for the software
fls variant compared to your proposal).


And when we start to look at larger values, the fls() one pulls even
further ahead.

So I'll stick with my proposal... I can write up a proper Changelog and
maybe add my test to tools/testing/sqrt/ if you think its worth it.


---
 lib/int_sqrt.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
index 1ef4cc344977..611933760225 100644
--- a/lib/int_sqrt.c
+++ b/lib/int_sqrt.c
@@ -7,12 +7,13 @@
 
 #include <linux/kernel.h>
 #include <linux/export.h>
+#include <linux/bitops.h>
 
 /**
- * int_sqrt - rough approximation to sqrt
+ * int_sqrt - Computes the integer square root
  * @x: integer of which to calculate the sqrt
  *
- * A very rough approximation to the sqrt() function.
+ * Computes: floor(sqrt(@x))
  */
 unsigned long int_sqrt(unsigned long x)
 {
@@ -21,7 +22,11 @@ unsigned long int_sqrt(unsigned long x)
 	if (x <= 1)
 		return x;
 
-	m = 1UL << (BITS_PER_LONG - 2);
+	m = 1UL << (__fls(x) & ~1U);
+
+	while (m > x)
+		m >>= 2;
+
 	while (m != 0) {
 		b = y + m;
 		y >>= 1;

Download attachment "sqrt.tar" of type "application/x-tar" (20480 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ