linux-kernel - Re: [PATCH v2 next 3/4] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250521135246.7dab6bda@pumpkin>
Date: Wed, 21 May 2025 13:52:46 +0100
From: David Laight <david.laight.linux@...il.com>
To: Nicolas Pitre <npitre@...libre.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org,
 u.kleine-koenig@...libre.com, Oleg Nesterov <oleg@...hat.com>, Peter
 Zijlstra <peterz@...radead.org>, Biju Das <biju.das.jz@...renesas.com>
Subject: Re: [PATCH v2 next 3/4] lib: Add mul_u64_add_u64_div_u64() and
 mul_u64_u64_div_u64_roundup()

On Tue, 20 May 2025 18:24:58 -0400 (EDT)
Nicolas Pitre <npitre@...libre.com> wrote:

> On Tue, 20 May 2025, David Laight wrote:
> 
> > On Mon, 19 May 2025 23:03:21 -0400 (EDT)
> > Nicolas Pitre <npitre@...libre.com> wrote:
> >   
...
> > > Here you should do:
> > > 
> > > 	if (ilog2(a) + ilog2(b) <= 62) {
> > > 		u64 ab = a * b;
> > > 		u64 abc = ab + c;
> > > 		if (ab <= abc)
> > > 			return div64_u64(abc, d);
> > > 	}
> > > 
> > > This is cheap and won't unconditionally discard the faster path when c != 0;  
> > 
> > That isn't really cheap.
> > ilog2() is likely to be a similar cost to a multiply
> > (my brain remembers them both as 'latency 3' on x86).  
> 
> I'm not discussing the ilog2() usage though. I'm just against limiting 
> the test to !c. My suggestion is about supporting all values of c.

I've had further thoughts on that test.
Most (but not all - and I've forgotten which) 64bit cpu have a 64x64=>128
multiple and support u128.
So the 'multiply in parts' code is mostly for 32bit.
That means that the 'a * b' for the call to div64_u64() has to be three
32x32=>64 multiplies with all the extra 'add' and 'adc $0' to generate
a correct 64bit result.
This is (well should be) much the same as the multiply coded in the
function - except it is generated by the compiler itself.
The only parts it can ignore are the those that set 'z' and 'y_hi'.
If my clock sequence (in the other email) is correct it saves 3 of 10
clocks - so test to avoid the multiply has to be better than that.
That probably means the only worthwhile check is for a and b being 32bit
so a single multiply can be used.

The generated code for 32bit x86 isn't as good as one might hope.
partially due to only having 7 (6 if %bp is a stack frame) registers.
clang makes a reasonable job of it, gcc doesn't.
There is already a mul_u32_u32() wrapper in arch/x86/include/asm/div64.h.
There needs to be a similar add_u64_u32() (contains add %s,%d_lo, adc $0,%d_hi).
Without them gcc spills a lot of values to stack - including constant zeros.

I might add those and use them in v3 (which I need to send anyway).
They'll match what my 'pending' faster code does.

	David