linux-kernel - Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <os452n92-2p25-2q4r-453r-0ps224s90r98@syhkavp.arg>
Date: Tue, 17 Jun 2025 00:16:34 -0400 (EDT)
From: Nicolas Pitre <nico@...xnic.net>
To: David Laight <david.laight.linux@...il.com>
cc: Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org, 
    u.kleine-koenig@...libre.com, Oleg Nesterov <oleg@...hat.com>, 
    Peter Zijlstra <peterz@...radead.org>, 
    Biju Das <biju.das.jz@...renesas.com>
Subject: Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the
 divide code

On Sat, 14 Jun 2025, David Laight wrote:

> Replace the bit by bit algorithm with one that generates 16 bits
> per iteration on 32bit architectures and 32 bits on 64bit ones.
> 
> On my zen 5 this reduces the time for the tests (using the generic
> code) from ~3350ns to ~1000ns.
> 
> Running the 32bit algorithm on 64bit x86 takes ~1500ns.
> It'll be slightly slower on a real 32bit system, mostly due
> to register pressure.
> 
> The savings for 32bit x86 are much higher (tested in userspace).
> The worst case (lots of bits in the quotient) drops from ~900 clocks
> to ~130 (pretty much independant of the arguments).
> Other 32bit architectures may see better savings.
> 
> It is possibly to optimise for divisors that span less than
> __LONG_WIDTH__/2 bits. However I suspect they don't happen that often
> and it doesn't remove any slow cpu divide instructions which dominate
> the result.
> 
> Signed-off-by: David Laight <david.laight.linux@...il.com>

Nice work. I had to be fully awake to review this one.
Some suggestions below.

> +	reps = 64 / BITS_PER_ITER;
> +	/* Optimise loop count for small dividends */
> +	if (!(u32)(n_hi >> 32)) {
> +		reps -= 32 / BITS_PER_ITER;
> +		n_hi = n_hi << 32 | n_lo >> 32;
> +		n_lo <<= 32;
> +	}
> +#if BITS_PER_ITER == 16
> +	if (!(u32)(n_hi >> 48)) {
> +		reps--;
> +		n_hi = add_u64_u32(n_hi << 16, n_lo >> 48);
> +		n_lo <<= 16;
> +	}
> +#endif

I think it would be more beneficial to integrate this within the loop 
itself instead. It is often the case that the dividend is initially big, 
and becomes zero or too small for the division in the middle of the 
loop.

Something like:

	unsigned long n;
	...

	reps = 64 / BITS_PER_ITER;
	while (reps--) {
		quotient <<= BITS_PER_ITER;
		n = ~n_hi >> (64 - 2 * BITS_PER_ITER);
		if (n < d_msig) {
			n_hi = (n_hi << BITS_PER_ITER | (n_lo >> (64 - BITS_PER_ITER)));
			n_lo <<= BITS_PER_ITER;
			continue;
		}
		q_digit = n / d_msig;
		...

This way small dividends are optimized as well as dividends with holes 
in them. And this allows for the number of loops to become constant 
which opens some unrolling optimization possibilities.

> +	/*
> +	 * Get the most significant BITS_PER_ITER bits of the divisor.
> +	 * This is used to get a low 'guestimate' of the quotient digit.
> +	 */
> +	d_msig = (d >> (64 - BITS_PER_ITER)) + 1;

Here the unconditional +1 is pessimizing d_msig too much. You should do 
this instead:

	d_msig = (d >> (64 - BITS_PER_ITER)) + !!(d << BITS_PER_ITER);

In other words, you want to round up the value, not an unconditional +1 
causing several unnecessary overflow fixup loops.


Nicolas