[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <os452n92-2p25-2q4r-453r-0ps224s90r98@syhkavp.arg>
Date: Tue, 17 Jun 2025 00:16:34 -0400 (EDT)
From: Nicolas Pitre <nico@...xnic.net>
To: David Laight <david.laight.linux@...il.com>
cc: Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org,
u.kleine-koenig@...libre.com, Oleg Nesterov <oleg@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Biju Das <biju.das.jz@...renesas.com>
Subject: Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the
divide code
On Sat, 14 Jun 2025, David Laight wrote:
> Replace the bit by bit algorithm with one that generates 16 bits
> per iteration on 32bit architectures and 32 bits on 64bit ones.
>
> On my zen 5 this reduces the time for the tests (using the generic
> code) from ~3350ns to ~1000ns.
>
> Running the 32bit algorithm on 64bit x86 takes ~1500ns.
> It'll be slightly slower on a real 32bit system, mostly due
> to register pressure.
>
> The savings for 32bit x86 are much higher (tested in userspace).
> The worst case (lots of bits in the quotient) drops from ~900 clocks
> to ~130 (pretty much independant of the arguments).
> Other 32bit architectures may see better savings.
>
> It is possibly to optimise for divisors that span less than
> __LONG_WIDTH__/2 bits. However I suspect they don't happen that often
> and it doesn't remove any slow cpu divide instructions which dominate
> the result.
>
> Signed-off-by: David Laight <david.laight.linux@...il.com>
Nice work. I had to be fully awake to review this one.
Some suggestions below.
> + reps = 64 / BITS_PER_ITER;
> + /* Optimise loop count for small dividends */
> + if (!(u32)(n_hi >> 32)) {
> + reps -= 32 / BITS_PER_ITER;
> + n_hi = n_hi << 32 | n_lo >> 32;
> + n_lo <<= 32;
> + }
> +#if BITS_PER_ITER == 16
> + if (!(u32)(n_hi >> 48)) {
> + reps--;
> + n_hi = add_u64_u32(n_hi << 16, n_lo >> 48);
> + n_lo <<= 16;
> + }
> +#endif
I think it would be more beneficial to integrate this within the loop
itself instead. It is often the case that the dividend is initially big,
and becomes zero or too small for the division in the middle of the
loop.
Something like:
unsigned long n;
...
reps = 64 / BITS_PER_ITER;
while (reps--) {
quotient <<= BITS_PER_ITER;
n = ~n_hi >> (64 - 2 * BITS_PER_ITER);
if (n < d_msig) {
n_hi = (n_hi << BITS_PER_ITER | (n_lo >> (64 - BITS_PER_ITER)));
n_lo <<= BITS_PER_ITER;
continue;
}
q_digit = n / d_msig;
...
This way small dividends are optimized as well as dividends with holes
in them. And this allows for the number of loops to become constant
which opens some unrolling optimization possibilities.
> + /*
> + * Get the most significant BITS_PER_ITER bits of the divisor.
> + * This is used to get a low 'guestimate' of the quotient digit.
> + */
> + d_msig = (d >> (64 - BITS_PER_ITER)) + 1;
Here the unconditional +1 is pessimizing d_msig too much. You should do
this instead:
d_msig = (d >> (64 - BITS_PER_ITER)) + !!(d << BITS_PER_ITER);
In other words, you want to round up the value, not an unconditional +1
causing several unnecessary overflow fixup loops.
Nicolas
Powered by blists - more mailing lists