lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <os452n92-2p25-2q4r-453r-0ps224s90r98@syhkavp.arg>
Date: Tue, 17 Jun 2025 00:16:34 -0400 (EDT)
From: Nicolas Pitre <nico@...xnic.net>
To: David Laight <david.laight.linux@...il.com>
cc: Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org, 
    u.kleine-koenig@...libre.com, Oleg Nesterov <oleg@...hat.com>, 
    Peter Zijlstra <peterz@...radead.org>, 
    Biju Das <biju.das.jz@...renesas.com>
Subject: Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the
 divide code

On Sat, 14 Jun 2025, David Laight wrote:

> Replace the bit by bit algorithm with one that generates 16 bits
> per iteration on 32bit architectures and 32 bits on 64bit ones.
> 
> On my zen 5 this reduces the time for the tests (using the generic
> code) from ~3350ns to ~1000ns.
> 
> Running the 32bit algorithm on 64bit x86 takes ~1500ns.
> It'll be slightly slower on a real 32bit system, mostly due
> to register pressure.
> 
> The savings for 32bit x86 are much higher (tested in userspace).
> The worst case (lots of bits in the quotient) drops from ~900 clocks
> to ~130 (pretty much independant of the arguments).
> Other 32bit architectures may see better savings.
> 
> It is possibly to optimise for divisors that span less than
> __LONG_WIDTH__/2 bits. However I suspect they don't happen that often
> and it doesn't remove any slow cpu divide instructions which dominate
> the result.
> 
> Signed-off-by: David Laight <david.laight.linux@...il.com>

Nice work. I had to be fully awake to review this one.
Some suggestions below.

> +	reps = 64 / BITS_PER_ITER;
> +	/* Optimise loop count for small dividends */
> +	if (!(u32)(n_hi >> 32)) {
> +		reps -= 32 / BITS_PER_ITER;
> +		n_hi = n_hi << 32 | n_lo >> 32;
> +		n_lo <<= 32;
> +	}
> +#if BITS_PER_ITER == 16
> +	if (!(u32)(n_hi >> 48)) {
> +		reps--;
> +		n_hi = add_u64_u32(n_hi << 16, n_lo >> 48);
> +		n_lo <<= 16;
> +	}
> +#endif

I think it would be more beneficial to integrate this within the loop 
itself instead. It is often the case that the dividend is initially big, 
and becomes zero or too small for the division in the middle of the 
loop.

Something like:

	unsigned long n;
	...

	reps = 64 / BITS_PER_ITER;
	while (reps--) {
		quotient <<= BITS_PER_ITER;
		n = ~n_hi >> (64 - 2 * BITS_PER_ITER);
		if (n < d_msig) {
			n_hi = (n_hi << BITS_PER_ITER | (n_lo >> (64 - BITS_PER_ITER)));
			n_lo <<= BITS_PER_ITER;
			continue;
		}
		q_digit = n / d_msig;
		...

This way small dividends are optimized as well as dividends with holes 
in them. And this allows for the number of loops to become constant 
which opens some unrolling optimization possibilities.

> +	/*
> +	 * Get the most significant BITS_PER_ITER bits of the divisor.
> +	 * This is used to get a low 'guestimate' of the quotient digit.
> +	 */
> +	d_msig = (d >> (64 - BITS_PER_ITER)) + 1;

Here the unconditional +1 is pessimizing d_msig too much. You should do 
this instead:

	d_msig = (d >> (64 - BITS_PER_ITER)) + !!(d << BITS_PER_ITER);

In other words, you want to round up the value, not an unconditional +1 
causing several unnecessary overflow fixup loops.


Nicolas

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ