linux-kernel - Re: [PATCH] math64: Provide an uprounding variant of mul_u64_u64_div

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <gqqxuoz5jfrlsmrxdhwevfo7kflxjqhbkfy2ksnsdcadbk52hd@yaitrauy52xg>
Date: Wed, 2 Apr 2025 17:01:49 +0200
From: Uwe Kleine-König <u.kleine-koenig@...libre.com>
To: David Laight <david.laight.linux@...il.com>
Cc: Nicolas Pitre <nico@...xnic.net>, 
	Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] math64: Provide an uprounding variant of
 mul_u64_u64_div_u64()

Hello David,

On Wed, Apr 02, 2025 at 01:52:19PM +0100, David Laight wrote:
> On Wed, 2 Apr 2025 10:16:19 +0200
> Uwe Kleine-König <u.kleine-koenig@...libre.com> wrote:
> 
> > On Tue, Apr 01, 2025 at 10:37:31PM +0100, David Laight wrote:
> > > On Tue, 1 Apr 2025 16:30:34 -0400 (EDT)
> > > Nicolas Pitre <nico@...xnic.net> wrote:
> > >   
> > > > On Tue, 1 Apr 2025, Nicolas Pitre wrote:
> > > >   
> > > > > On Tue, 1 Apr 2025, David Laight wrote:
> > > > >     
> > > > > > Looking at the C version, I wonder if the two ilog2() calls are needed.
> > > > > > They may not be cheap, and are the same as checking 'n_hi == 0'.    
> > > > > 
> > > > > Which two calls? I see only one.    
> > > > 
> > > > Hmmm, sorry. If by ilog2() you mean the clz's then those are cheap. Most 
> > > > CPUs have a dedicated instruction for that. The ctz, though, is more 
> > > > expensive (unless it is ARMv7 and above with an RBIT instruction).  
> > > 
> > > The code (6.14-rc6) starts:
> > > 
> > > u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
> > > {
> > > 	if (ilog2(a) + ilog2(b) <= 62)
> > > 		return div64_u64(a * b, c);
> > > 
> > > Now ilog2() is (probably) much the same as clz - but not all cpu have it.
> > > Indeed clz(a) + clz(b) >= 64 would be a more obvious test.  
> > 
> > Ack, the more intuitive check might be
> > 
> > 	if (fls64(a) + fls64(b) <= 64)
> >  		return div64_u64(a * b, c);
> > 
> > then, but maybe "intuitive" is subjective here?
> 
> Depends on whether you grok 'clz' or 'fls' :-)

And it also depends on the availability of the respective functions.
There is a fls64 function provided by include/asm-generic/bitops/fls64.h
and in several arch-specific arch/*/include/asm/bitops.h, but I didn't
find a clz function apart from one for arch=arc.

> > > On 32bit a check for a >> 32 | b >> 32 before the multiply might be sane.  
> > 
> > Not 100% sure I'm following. `a >> 32 | b >> 32` is just an indicator
> > that a * b fits into an u64 and in that case the above check is the
> > better one as this also catches e.g. a = (1ULL << 32) and b = 42.
> 
> That is to optimise the multiple as well as the divide.
> It is also a very cheap test.

OK, so we have:

	                                 `a >> 32 | b >> 32` | `fls64(a) + fls64(b) <= 64`
	cheap                          |          ✓          |             ✓
	allows to skip multiplication  |          ✓          |             ✓
	allows to skip 128bit division |          ✓          |             ✓
	catches all skip possibilities |          x          |             ✓

> > > > > And please explain how it can be the same as checking 'n_hi == 0'.    
> > > > 
> > > > This question still stands.  
> > > 
> > > 'ni_hi == 0' is exactly the condition under which you can do a 64 bit divide.
> > > Since it is when 'a * b' fits in 64 bits.
> > > 
> > > If you assume that most calls need the 128bit product (or why use the function)
> > > then there is little point adding code to optimise for the unusual case.  
> > 
> > n_hi won't be zero when the branch
> > 
> > 	if (ilog2(a) + ilog2(b) <= 62)
> > 		return div64_u64(a * b, c);
> > 
> > wasn't taken?
> 
> Right, but you can remove that test and check n_hi instead.
> The multiplies are cheap compared to the divide loop.
> (Especially when uint128 exists.)

But you can check n_hi only after you completed the multiplication, so
checking `ilog2(a) + ilog2(b) <= 62` (or `fls64(a) + fls64(b) <= 64` or
`clz(a) + clz(b) >= 64`) sounds like a good idea to me.

> I also think you can do a much better divide loop (for many cpu).
> Shift 'c' so that clz(c) is 32.
> Shift 'n_hi/n_lo' so that clz(n_hi) is 1.
> Do a 64 by 32 divide and remainder (probably about 32 or 64 clocks).
> If bits of 'c' were discarded multiple and subtract (with overflow check).
> 'Rinse and repeat' with the remainder.
> Build the quotient out of all the partial values.

I let this for Nicolas to reply. I guess he is much more into these
details than me.

Best regards
Uwe

Download attachment "signature.asc" of type "application/pgp-signature" (489 bytes)