linux-kernel - Re: [PATCH v2 1/2] mul_u64_u64_div

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3yh26lm4kiig6oyy7st3q2wge2idbzblues5q6jf7xld2mlgba@jsfny35ucs5k>
Date: Fri, 5 Jul 2024 12:19:01 +0200
From: Uwe Kleine-König <u.kleine-koenig@...libre.com>
To: Nicolas Pitre <nico@...xnic.net>
Cc: Andrew Morton <akpm@...ux-foundation.org>, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 1/2] mul_u64_u64_div_u64: make it precise always

On Thu, Jul 04, 2024 at 05:16:37PM -0400, Nicolas Pitre wrote:
> On Thu, 4 Jul 2024, Uwe Kleine-König wrote:
> 
> > Hello Nico,
> > 
> > On Tue, Jul 02, 2024 at 11:34:08PM -0400, Nicolas Pitre wrote:
> > > +	A.v = a;
> > > +	B.v = b;
> > > +
> > > +	X.v = (u64)A.l * B.l;
> > > +	Y.v = (u64)A.l * B.h + X.h;
> > > +	Z.v = (u64)A.h * B.h + Y.h;
> > > +	Y.v = (u64)A.h * B.l + Y.l;
> > > +	X.h = Y.l;
> > > +	Z.v += Y.h;
> > > +
> > > +	u64 n_lo = X.v, n_hi = Z.v;
> > 
> > I tried to understand your patch. This part could really benefit from
> > some comments. With pen and paper I worked out your idea:
> > 
> > 	The goal is:
> > 
> > 		A * B == Z << 64 + X
> > 
> > 	With A = A.h << 32 + A.l and similar identities for B, we have:
> > 
> > 	A * B = (A.h << 32 + A.l) * (B.h << 32 + B.l)
> > 	      = (A.h * B.h << 64 + (A.h * B.l + A.l * B.h) << 32 + A.l * B.l
>                           ^ missing )

Ack, alternatively the opening ( can be dropped.

> > 	The operations done here are only 32 bit multiplications and
> > 	additions, and with U32_MAX = 0xffffffff we have:
> > 	U32_MAX * U32_MAX + U32_MAX = (U32_MAX + 1) * U32_MAX =
> > 	0xffffffff00000000 which fits into an u64.  Even when adding
> > 	another U32_MAX (which happens with Z.v += Y.h) it still fits
> > 	into u64, and so the operations won't overflow.
> 
> Exact, that's the idea.
> 
> I was about to reproduce the code I wrote for a similar purpose about 18 
> years ago that currently lives in __arch_xprod_64() 
> (include/asm-generic/div64.h). when I realized that, with some 
> reordering, all the overflow handling could be avoided entirely.
> 
> So I'm about to submit some nice simplification for my old optimized 
> __div64_const32() based on this realisation.
> 
> > > +	/* Do the full 128 by 64 bits division */
> > 
> > Here is the code location where I stop understanding your code :-)
> 
> Here's how it goes:
> 
> To do a binary division, you have to align the numbers, find how many 
> times the 
> divisor fits and subtract, just like we learned in primary school. 
> Except that we have binary numbers instead of base 10 numbers, making 
> the "how many times" either 0 or 1.
> 
> To align numbers, let's simply move set bits to the top left bit.
> 
> Let's suppose a divisor c of 0x00ffffff00000000
> 
>         shift = __builtin_clzll(c);
>         c <<= shift;
> 
> so shift = 8 and c becomes 0xffffff0000000000
> 
> Also need to track the actual divisor power. Given that we start working 
> on n_hi not n_lo, this means the power is initially 64. But we just 
> shifted c which increased its power:
> 
>         p = 64 + shift;
> 
> Then, remember that n_hi < original c. That's ensured by the overflow 
> test earlier. So shifting n_hi leftwards will require a greater shift 
> than the one we applied to c, meaning that p will become 63 or less 
> during the first loop.
> 
> Let's suppose n_hi = 0x000ffffffff00000 and n_lo = 0
> 
> Then enter the loop:
> 
>                 carry = n_hi >> 63;
> 
> Top bit of n_hi is unset so no carry.
> 
>                 shift = carry ? 1 : __builtin_clzll(n_hi);
> 
> If n'hi's top bit was set we'd have a shift of 1 with a carry. But here 
> there is no carry and aligning n_hi to the top left bit requires a shift 
> of 12.
> 
>                 n_hi <<= shift;
>                 n_hi |= n_lo >> (64 - shift);
>                 n_lo <<= shift;
> 
> So n_hi is now 0xffffffff00000000
> 
>                 p -= shift;
> 
> Shifting left the dividend reduces the divisor's power.
> So p is now 64 + 8 - 12 = 60
> 
> Then, the crux of the operation:
> 
>                 if (carry || (n_hi >= c)) {
>                         n_hi -= c;
>                         res |= 1ULL << p;
>                 }
> 
> So... if the divisor fits then we add a 1 to the result and subtract it.
> n_hi = 0xffffffff00000000 - 0xffffff0000000000 = 0x000000ff00000000
> res |= 1 << 60
> 
> Let's loop again:
> 
>                 carry = n_hi >> 63;
>                 shift = carry ? 1 : __builtin_clzll(n_hi);
>                 ...
> 
> No carry, shift becomes 24, p becomes 60 - 24 = 36 and
> n_hi becomes 0xff00000000000000.
> 
>                 if (carry || (n_hi >= c)) { ...
> 
> No carry and n_hi is smaller than c so loop again.
> 
>                 carry = n_hi >> 63;
>                 shift = carry ? 1 : __builtin_clzll(n_hi);
> 
> This time we have a carry as the top bit of n_hi is set and we're about 
> to shift it by 1. p becomes 35 and n_hi becomes 0xfe00000000000000. In 
> reality it is like having 0x1fe00000000000000 (a 65-bits value) which is 
> obviously bigger than 0xffffff0000000000. So we can augment the result 
> and subtract. Thanks to two's complement, we have:
> 
> n_hi = 0xfe00000000000000 - 0xffffff0000000000 = 0xfe00010000000000
> 
> and 
> 
> res = 1 << 60 | 1 << 35

Oh wow, that part is clever. Before your mail I wondered for a while why
the right thing happens if carry=1 but n_hi < c.

> And so on until either n_hi becomes 0 or p would go negative, which 
> might happen quite quickly in some cases.

OK, so the loop invariant at the end of each iteration is:

	final result = res + (n_hi#n_lo << p) / c

(with n_hi#n_lo = n_hi << 64 | n_lo), right?

> > > +	/* The remainder value if needed would be n_hi << p */
> > 
> > I indeed need a variant of this function that rounds up. So maybe
> > creating a function
> > 
> > 	u64 mul_u64_u64_div_u64_rem(u64 a, u64 b, u64 c, u64 *rem)
> > 
> > with the sophistication of your mul_u64_u64_div_u64 and then:
> > 
> > 	u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
> > 	{
> > 		u64 rem, ret;
> > 
> > 		ret = mul_u64_u64_div_u64_rem(a, b, c, &rem);
> > 		return ret;
> > 	}
> > 
> > (In the hope that the compiler optimizes out the calculation for the
> > remainder) 
> 
> It probably won't unless the core function is a static inline.
> 
> It might be more efficient to do this:
> 
> u64 mul_u64_u64_div_u64_rem(u64 a, u64 b, u64 c, u64 *rem)
> {
> 	u64 res = u64 mul_u64_u64_div_u64(a, b, c);
> 
> 	/* those multiplications will overflow but it doesn't matter */
> 	*rem = a * b - c * res;
> 
> 	return res;
> }
> 
> This way the core code doesn't get duplicated.

Good idea. I'll check that after the discussion about this patch.

Best regards
Uwe

Download attachment "signature.asc" of type "application/pgp-signature" (489 bytes)