linux-kernel - Re: [PATCH v2 next 3/4] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <403s8q39-33sp-pp3s-95o8-14s190or25o5@onlyvoer.pbz>
Date: Wed, 21 May 2025 09:50:28 -0400 (EDT)
From: Nicolas Pitre <npitre@...libre.com>
To: David Laight <david.laight.linux@...il.com>
cc: Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org, 
    u.kleine-koenig@...libre.com, Oleg Nesterov <oleg@...hat.com>, 
    Peter Zijlstra <peterz@...radead.org>, 
    Biju Das <biju.das.jz@...renesas.com>
Subject: Re: [PATCH v2 next 3/4] lib: Add mul_u64_add_u64_div_u64() and
 mul_u64_u64_div_u64_roundup()

On Wed, 21 May 2025, David Laight wrote:

> On Tue, 20 May 2025 18:24:58 -0400 (EDT)
> Nicolas Pitre <npitre@...libre.com> wrote:
> 
> > On Tue, 20 May 2025, David Laight wrote:
> > 
> > > On Mon, 19 May 2025 23:03:21 -0400 (EDT)
> > > Nicolas Pitre <npitre@...libre.com> wrote:
> > >   
> ...
> > > > Here you should do:
> > > > 
> > > > 	if (ilog2(a) + ilog2(b) <= 62) {
> > > > 		u64 ab = a * b;
> > > > 		u64 abc = ab + c;
> > > > 		if (ab <= abc)
> > > > 			return div64_u64(abc, d);
> > > > 	}
> > > > 
> > > > This is cheap and won't unconditionally discard the faster path when c != 0;  
> > > 
> > > That isn't really cheap.
> > > ilog2() is likely to be a similar cost to a multiply
> > > (my brain remembers them both as 'latency 3' on x86).  
> > 
> > I'm not discussing the ilog2() usage though. I'm just against limiting 
> > the test to !c. My suggestion is about supporting all values of c.
> 
> I've had further thoughts on that test.
> Most (but not all - and I've forgotten which) 64bit cpu have a 64x64=>128
> multiple and support u128.

Looks like X86-64, ARM64 and RV64 have it. That's probably 99% of the market.

> So the 'multiply in parts' code is mostly for 32bit.

Exact.

> That means that the 'a * b' for the call to div64_u64() has to be three
> 32x32=>64 multiplies with all the extra 'add' and 'adc $0' to generate
> a correct 64bit result.

4 multiplies to be precise.

> This is (well should be) much the same as the multiply coded in the
> function - except it is generated by the compiler itself.

I don't follow you here. What is the same as what?

> The only parts it can ignore are the those that set 'z' and 'y_hi'.
> If my clock sequence (in the other email) is correct it saves 3 of 10
> clocks - so test to avoid the multiply has to be better than that.
> That probably means the only worthwhile check is for a and b being 32bit
> so a single multiply can be used.

Depends how costly the ilog2 is. On ARM the clz instruction is about 1 
cycle. If you need to figure out the MSB manually then it might be best 
to skip those ilog2's.

> The generated code for 32bit x86 isn't as good as one might hope.
> partially due to only having 7 (6 if %bp is a stack frame) registers.
> clang makes a reasonable job of it, gcc doesn't.
> There is already a mul_u32_u32() wrapper in arch/x86/include/asm/div64.h.
> There needs to be a similar add_u64_u32() (contains add %s,%d_lo, adc $0,%d_hi).
> Without them gcc spills a lot of values to stack - including constant zeros.

I mainly looked at ARM32 and both gcc and clang do a good job here. ARM 
registers are plentiful of course.

> I might add those and use them in v3 (which I need to send anyway).
> They'll match what my 'pending' faster code does.
> 
> 	David
> 
>