[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <iz5kp75bwhesaw7kxga7rq7m4oahl3wasnfrdrtxukpivhvy5g@3scc5272aybh>
Date: Thu, 3 Apr 2025 08:08:36 +0200
From: Uwe Kleine-König <u.kleine-koenig@...libre.com>
To: David Laight <david.laight.linux@...il.com>
Cc: Nicolas Pitre <nico@...xnic.net>,
Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] math64: Provide an uprounding variant of
mul_u64_u64_div_u64()
On Wed, Apr 02, 2025 at 09:59:19PM +0100, David Laight wrote:
> On Wed, 2 Apr 2025 17:01:49 +0200
> Uwe Kleine-König <u.kleine-koenig@...libre.com> wrote:
>
> > Hello David,
> >
> > On Wed, Apr 02, 2025 at 01:52:19PM +0100, David Laight wrote:
> > > On Wed, 2 Apr 2025 10:16:19 +0200
> > > Uwe Kleine-König <u.kleine-koenig@...libre.com> wrote:
> > >
> > > > On Tue, Apr 01, 2025 at 10:37:31PM +0100, David Laight wrote:
> > > > > On Tue, 1 Apr 2025 16:30:34 -0400 (EDT)
> > > > > Nicolas Pitre <nico@...xnic.net> wrote:
> > > > >
> > > > > > On Tue, 1 Apr 2025, Nicolas Pitre wrote:
> > > > > >
> > > > > > > On Tue, 1 Apr 2025, David Laight wrote:
> > > > > > >
> > > > > > > > Looking at the C version, I wonder if the two ilog2() calls are needed.
> > > > > > > > They may not be cheap, and are the same as checking 'n_hi == 0'.
> > > > > > >
> > > > > > > Which two calls? I see only one.
> > > > > >
> > > > > > Hmmm, sorry. If by ilog2() you mean the clz's then those are cheap. Most
> > > > > > CPUs have a dedicated instruction for that. The ctz, though, is more
> > > > > > expensive (unless it is ARMv7 and above with an RBIT instruction).
> > > > >
> > > > > The code (6.14-rc6) starts:
> > > > >
> > > > > u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
> > > > > {
> > > > > if (ilog2(a) + ilog2(b) <= 62)
> > > > > return div64_u64(a * b, c);
> > > > >
> > > > > Now ilog2() is (probably) much the same as clz - but not all cpu have it.
> > > > > Indeed clz(a) + clz(b) >= 64 would be a more obvious test.
> > > >
> > > > Ack, the more intuitive check might be
> > > >
> > > > if (fls64(a) + fls64(b) <= 64)
> > > > return div64_u64(a * b, c);
> > > >
> > > > then, but maybe "intuitive" is subjective here?
> > >
> > > Depends on whether you grok 'clz' or 'fls' :-)
> >
> > And it also depends on the availability of the respective functions.
> > There is a fls64 function provided by include/asm-generic/bitops/fls64.h
> > and in several arch-specific arch/*/include/asm/bitops.h, but I didn't
> > find a clz function apart from one for arch=arc.
> >
> > > > > On 32bit a check for a >> 32 | b >> 32 before the multiply might be sane.
> > > >
> > > > Not 100% sure I'm following. `a >> 32 | b >> 32` is just an indicator
> > > > that a * b fits into an u64 and in that case the above check is the
> > > > better one as this also catches e.g. a = (1ULL << 32) and b = 42.
> > >
> > > That is to optimise the multiple as well as the divide.
> > > It is also a very cheap test.
> >
> > OK, so we have:
> >
> > `a >> 32 | b >> 32` | `fls64(a) + fls64(b) <= 64`
> > cheap | ✓ | ✓
>
> fls() isn't always cheap.
> x86 has had bsr since the 386, but I don't remember seeing it in arm32 or ppc.
I don't know about ppc, but arm32 has a clz instruction. With the
definition of fls64 for BITS_PER_LONG == 32 in
include/asm-generic/bitops/fls64.h:
static __always_inline int fls64(__u64 x)
{
__u32 h = x >> 32;
if (h)
return fls(h) + 32;
return fls(x);
}
and fls from include/asm-generic/bitops/builtin-fls.h:
static __always_inline int fls(unsigned int x)
{
return x ? sizeof(x) * 8 - __builtin_clz(x) : 0;
}
I'd claim this qualifies for "cheap". Agreed, it's still more expensive
than a >> 32 | b >> 32, but its result is also better. I guess we cannot
judge without a deeper analysis and profiling and a guess about likely
values of a and b.
[Side note: gcc also has a __builtin_clzll which might be a good fit to
implement fls64.]
> The 'a >> 32 | b >> 32' is very cheap on 32bit.
>
> > allows to skip multiplication | ✓ | ✓
> > allows to skip 128bit division | ✓ | ✓
> > catches all skip possibilities | x | ✓
> >
> > > > > > > And please explain how it can be the same as checking 'n_hi == 0'.
> > > > > >
> > > > > > This question still stands.
> > > > >
> > > > > 'ni_hi == 0' is exactly the condition under which you can do a 64 bit divide.
> > > > > Since it is when 'a * b' fits in 64 bits.
> > > > >
> > > > > If you assume that most calls need the 128bit product (or why use the function)
> > > > > then there is little point adding code to optimise for the unusual case.
> > > >
> > > > n_hi won't be zero when the branch
> > > >
> > > > if (ilog2(a) + ilog2(b) <= 62)
> > > > return div64_u64(a * b, c);
> > > >
> > > > wasn't taken?
> > >
> > > Right, but you can remove that test and check n_hi instead.
> > > The multiplies are cheap compared to the divide loop.
> > > (Especially when uint128 exists.)
> >
> > But you can check n_hi only after you completed the multiplication, so
> > checking `ilog2(a) + ilog2(b) <= 62` (or `fls64(a) + fls64(b) <= 64` or
> > `clz(a) + clz(b) >= 64`) sounds like a good idea to me.
>
> Depends on how much of the time you think the multiply is needed.
> If it is needed most of the time you want to do it first.
> If it is hardly ever needed then the initial check is likely to be a gain.
It also depends on the costs of the two checks. If they are similar,
doing the early check is better.
Anyhow, this discussion is quite a bit away from what I want to achieve.
My objective is to drop the local implementatino of
mul_u64_u64_div_u64_roundup() from the pwm-stm32 driver. So I suggest we
either add my suggested mul_u64_u64_div_u64_roundup() now or you care
about adding an optimized variant of it (in a timely manner is
welcomed).
Best regards
Uwe
Download attachment "signature.asc" of type "application/pgp-signature" (489 bytes)
Powered by blists - more mailing lists