linux-kernel - Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7r88r006-o704-7q1q-21o2-6n62o864oo79@syhkavp.arg>
Date: Tue, 17 Jun 2025 21:33:23 -0400 (EDT)
From: Nicolas Pitre <nico@...xnic.net>
To: David Laight <david.laight.linux@...il.com>
cc: Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org, 
    u.kleine-koenig@...libre.com, Oleg Nesterov <oleg@...hat.com>, 
    Peter Zijlstra <peterz@...radead.org>, 
    Biju Das <biju.das.jz@...renesas.com>
Subject: Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the
 divide code

On Tue, 17 Jun 2025, Nicolas Pitre wrote:

> On Sat, 14 Jun 2025, David Laight wrote:
> 
> > Replace the bit by bit algorithm with one that generates 16 bits
> > per iteration on 32bit architectures and 32 bits on 64bit ones.
> > 
> > On my zen 5 this reduces the time for the tests (using the generic
> > code) from ~3350ns to ~1000ns.
> > 
> > Running the 32bit algorithm on 64bit x86 takes ~1500ns.
> > It'll be slightly slower on a real 32bit system, mostly due
> > to register pressure.
> > 
> > The savings for 32bit x86 are much higher (tested in userspace).
> > The worst case (lots of bits in the quotient) drops from ~900 clocks
> > to ~130 (pretty much independant of the arguments).
> > Other 32bit architectures may see better savings.
> > 
> > It is possibly to optimise for divisors that span less than
> > __LONG_WIDTH__/2 bits. However I suspect they don't happen that often
> > and it doesn't remove any slow cpu divide instructions which dominate
> > the result.
> > 
> > Signed-off-by: David Laight <david.laight.linux@...il.com>
> 
> Nice work. I had to be fully awake to review this one.
> Some suggestions below.

Here's a patch with my suggestions applied to make it easier to figure 
them out. The added "inline" is necessary to fix compilation on ARM32. 
The "likely()" makes for better assembly and this part is pretty much 
likely anyway. I've explained the rest previously, although this is a 
better implementation.

commit 99ea338401f03efe5dbebe57e62bd7c588409c5c
Author: Nicolas Pitre <nico@...xnic.net>
Date:   Tue Jun 17 14:42:34 2025 -0400

    fixup! lib: mul_u64_u64_div_u64() Optimise the divide code

diff --git a/lib/math/div64.c b/lib/math/div64.c
index 3c9fe878ce68..740e59a58530 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -188,7 +188,7 @@ EXPORT_SYMBOL(iter_div_u64_rem);
 
 #if !defined(mul_u64_add_u64_div_u64) || defined(test_mul_u64_add_u64_div_u64)
 
-static u64 mul_add(u32 a, u32 b, u32 c)
+static inline u64 mul_add(u32 a, u32 b, u32 c)
 {
 	return add_u64_u32(mul_u32_u32(a, b), c);
 }
@@ -246,7 +246,7 @@ static inline u32 mul_u64_long_add_u64(u64 *p_lo, u64 a, u32 b, u64 c)
 
 u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 {
-	unsigned long d_msig, q_digit;
+	unsigned long n_long, d_msig, q_digit;
 	unsigned int reps, d_z_hi;
 	u64 quotient, n_lo, n_hi;
 	u32 overflow;
@@ -271,36 +271,21 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 
 	/* Left align the divisor, shifting the dividend to match */
 	d_z_hi = __builtin_clzll(d);
-	if (d_z_hi) {
+	if (likely(d_z_hi)) {
 		d <<= d_z_hi;
 		n_hi = n_hi << d_z_hi | n_lo >> (64 - d_z_hi);
 		n_lo <<= d_z_hi;
 	}
 
-	reps = 64 / BITS_PER_ITER;
-	/* Optimise loop count for small dividends */
-	if (!(u32)(n_hi >> 32)) {
-		reps -= 32 / BITS_PER_ITER;
-		n_hi = n_hi << 32 | n_lo >> 32;
-		n_lo <<= 32;
-	}
-#if BITS_PER_ITER == 16
-	if (!(u32)(n_hi >> 48)) {
-		reps--;
-		n_hi = add_u64_u32(n_hi << 16, n_lo >> 48);
-		n_lo <<= 16;
-	}
-#endif
-
 	/* Invert the dividend so we can use add instead of subtract. */
 	n_lo = ~n_lo;
 	n_hi = ~n_hi;
 
 	/*
-	 * Get the most significant BITS_PER_ITER bits of the divisor.
+	 * Get the rounded-up most significant BITS_PER_ITER bits of the divisor.
 	 * This is used to get a low 'guestimate' of the quotient digit.
 	 */
-	d_msig = (d >> (64 - BITS_PER_ITER)) + 1;
+	d_msig = (d >> (64 - BITS_PER_ITER)) + !!(d << BITS_PER_ITER);
 
 	/*
 	 * Now do a 'long division' with BITS_PER_ITER bit 'digits'.
@@ -308,12 +293,17 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 	 * The worst case is dividing ~0 by 0x8000 which requires two subtracts.
 	 */
 	quotient = 0;
-	while (reps--) {
-		q_digit = (unsigned long)(~n_hi >> (64 - 2 * BITS_PER_ITER)) / d_msig;
+	for (reps = 64 / BITS_PER_ITER; reps; reps--) {
+		quotient <<= BITS_PER_ITER;
+		n_long = ~n_hi >> (64 - 2 * BITS_PER_ITER);
 		/* Shift 'n' left to align with the product q_digit * d */
 		overflow = n_hi >> (64 - BITS_PER_ITER);
 		n_hi = add_u64_u32(n_hi << BITS_PER_ITER, n_lo >> (64 - BITS_PER_ITER));
 		n_lo <<= BITS_PER_ITER;
+		/* cut it short if q_digit would be 0 */
+		if (n_long < d_msig)
+			continue;
+		q_digit = n_long / d_msig;
 		/* Add product to negated divisor */
 		overflow += mul_u64_long_add_u64(&n_hi, d, q_digit, n_hi);
 		/* Adjust for the q_digit 'guestimate' being low */
@@ -322,7 +312,7 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 			n_hi += d;
 			overflow += n_hi < d;
 		}
-		quotient = add_u64_long(quotient << BITS_PER_ITER, q_digit);
+		quotient = add_u64_long(quotient, q_digit);
 	}
 
 	/*