[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <gqqxuoz5jfrlsmrxdhwevfo7kflxjqhbkfy2ksnsdcadbk52hd@yaitrauy52xg>
Date: Wed, 2 Apr 2025 17:01:49 +0200
From: Uwe Kleine-König <u.kleine-koenig@...libre.com>
To: David Laight <david.laight.linux@...il.com>
Cc: Nicolas Pitre <nico@...xnic.net>,
Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] math64: Provide an uprounding variant of
mul_u64_u64_div_u64()
Hello David,
On Wed, Apr 02, 2025 at 01:52:19PM +0100, David Laight wrote:
> On Wed, 2 Apr 2025 10:16:19 +0200
> Uwe Kleine-König <u.kleine-koenig@...libre.com> wrote:
>
> > On Tue, Apr 01, 2025 at 10:37:31PM +0100, David Laight wrote:
> > > On Tue, 1 Apr 2025 16:30:34 -0400 (EDT)
> > > Nicolas Pitre <nico@...xnic.net> wrote:
> > >
> > > > On Tue, 1 Apr 2025, Nicolas Pitre wrote:
> > > >
> > > > > On Tue, 1 Apr 2025, David Laight wrote:
> > > > >
> > > > > > Looking at the C version, I wonder if the two ilog2() calls are needed.
> > > > > > They may not be cheap, and are the same as checking 'n_hi == 0'.
> > > > >
> > > > > Which two calls? I see only one.
> > > >
> > > > Hmmm, sorry. If by ilog2() you mean the clz's then those are cheap. Most
> > > > CPUs have a dedicated instruction for that. The ctz, though, is more
> > > > expensive (unless it is ARMv7 and above with an RBIT instruction).
> > >
> > > The code (6.14-rc6) starts:
> > >
> > > u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
> > > {
> > > if (ilog2(a) + ilog2(b) <= 62)
> > > return div64_u64(a * b, c);
> > >
> > > Now ilog2() is (probably) much the same as clz - but not all cpu have it.
> > > Indeed clz(a) + clz(b) >= 64 would be a more obvious test.
> >
> > Ack, the more intuitive check might be
> >
> > if (fls64(a) + fls64(b) <= 64)
> > return div64_u64(a * b, c);
> >
> > then, but maybe "intuitive" is subjective here?
>
> Depends on whether you grok 'clz' or 'fls' :-)
And it also depends on the availability of the respective functions.
There is a fls64 function provided by include/asm-generic/bitops/fls64.h
and in several arch-specific arch/*/include/asm/bitops.h, but I didn't
find a clz function apart from one for arch=arc.
> > > On 32bit a check for a >> 32 | b >> 32 before the multiply might be sane.
> >
> > Not 100% sure I'm following. `a >> 32 | b >> 32` is just an indicator
> > that a * b fits into an u64 and in that case the above check is the
> > better one as this also catches e.g. a = (1ULL << 32) and b = 42.
>
> That is to optimise the multiple as well as the divide.
> It is also a very cheap test.
OK, so we have:
`a >> 32 | b >> 32` | `fls64(a) + fls64(b) <= 64`
cheap | ✓ | ✓
allows to skip multiplication | ✓ | ✓
allows to skip 128bit division | ✓ | ✓
catches all skip possibilities | x | ✓
> > > > > And please explain how it can be the same as checking 'n_hi == 0'.
> > > >
> > > > This question still stands.
> > >
> > > 'ni_hi == 0' is exactly the condition under which you can do a 64 bit divide.
> > > Since it is when 'a * b' fits in 64 bits.
> > >
> > > If you assume that most calls need the 128bit product (or why use the function)
> > > then there is little point adding code to optimise for the unusual case.
> >
> > n_hi won't be zero when the branch
> >
> > if (ilog2(a) + ilog2(b) <= 62)
> > return div64_u64(a * b, c);
> >
> > wasn't taken?
>
> Right, but you can remove that test and check n_hi instead.
> The multiplies are cheap compared to the divide loop.
> (Especially when uint128 exists.)
But you can check n_hi only after you completed the multiplication, so
checking `ilog2(a) + ilog2(b) <= 62` (or `fls64(a) + fls64(b) <= 64` or
`clz(a) + clz(b) >= 64`) sounds like a good idea to me.
> I also think you can do a much better divide loop (for many cpu).
> Shift 'c' so that clz(c) is 32.
> Shift 'n_hi/n_lo' so that clz(n_hi) is 1.
> Do a 64 by 32 divide and remainder (probably about 32 or 64 clocks).
> If bits of 'c' were discarded multiple and subtract (with overflow check).
> 'Rinse and repeat' with the remainder.
> Build the quotient out of all the partial values.
I let this for Nicolas to reply. I guess he is much more into these
details than me.
Best regards
Uwe
Download attachment "signature.asc" of type "application/pgp-signature" (489 bytes)
Powered by blists - more mailing lists