linux-kernel - Re: [PATCH v5 next 4/9] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20251106095214.2b9c9b8c@pumpkin>
Date: Thu, 6 Nov 2025 09:52:14 +0000
From: David Laight <david.laight.linux@...il.com>
To: "H. Peter Anvin" <hpa@...or.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org,
 u.kleine-koenig@...libre.com, Nicolas Pitre <npitre@...libre.com>, Oleg
 Nesterov <oleg@...hat.com>, Peter Zijlstra <peterz@...radead.org>, Biju Das
 <biju.das.jz@...renesas.com>, Borislav Petkov <bp@...en8.de>, Dave Hansen
 <dave.hansen@...ux.intel.com>, Ingo Molnar <mingo@...hat.com>, Thomas
 Gleixner <tglx@...utronix.de>, Li RongQing <lirongqing@...du.com>,
 Khazhismel Kumykov <khazhy@...omium.org>, Jens Axboe <axboe@...nel.dk>,
 x86@...nel.org
Subject: Re: [PATCH v5 next 4/9] lib: Add mul_u64_add_u64_div_u64() and
 mul_u64_u64_div_u64_roundup()

On Wed, 05 Nov 2025 16:26:05 -0800
"H. Peter Anvin" <hpa@...or.com> wrote:

> On November 5, 2025 12:10:30 PM PST, David Laight <david.laight.linux@...il.com> wrote:
> >The existing mul_u64_u64_div_u64() rounds down, a 'rounding up'
> >variant needs 'divisor - 1' adding in between the multiply and
> >divide so cannot easily be done by a caller.
> >
> >Add mul_u64_add_u64_div_u64(a, b, c, d) that calculates (a * b + c)/d
> >and implement the 'round down' and 'round up' using it.
> >
> >Update the x86-64 asm to optimise for 'c' being a constant zero.
> >
> >Add kerndoc definitions for all three functions.
> >
> >Signed-off-by: David Laight <david.laight.linux@...il.com>
> >Reviewed-by: Nicolas Pitre <npitre@...libre.com>
> >---
> >
> >Changes for v2 (formally patch 1/3):
> >- Reinstate the early call to div64_u64() on 32bit when 'c' is zero.
> >  Although I'm not convinced the path is common enough to be worth
> >  the two ilog2() calls.
> > 
> >Changes for v3 (formally patch 3/4):
> >- The early call to div64_u64() has been removed by patch 3.
> >  Pretty much guaranteed to be a pessimisation.
> >
> >Changes for v4: 
> >- For x86-64 split the multiply, add and divide into three asm blocks.
> >  (gcc makes a pigs breakfast of (u128)a * b + c)
> >- Change the kerndoc since divide by zero will (probably) fault.
> >
> >Changes for v5:
> >- Fix test that excludes the add/adc asm block for constant zero 'add'.
> >
> > arch/x86/include/asm/div64.h | 20 +++++++++------
> > include/linux/math64.h       | 48 +++++++++++++++++++++++++++++++++++-
> > lib/math/div64.c             | 14 ++++++-----
> > 3 files changed, 67 insertions(+), 15 deletions(-)
> >
> >diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
> >index 9931e4c7d73f..6d8a3de3f43a 100644
> >--- a/arch/x86/include/asm/div64.h
> >+++ b/arch/x86/include/asm/div64.h
> >@@ -84,21 +84,25 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
> >  * Will generate an #DE when the result doesn't fit u64, could fix with an
> >  * __ex_table[] entry when it becomes an issue.
> >  */
> >-static inline u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div)
> >+static inline u64 mul_u64_add_u64_div_u64(u64 rax, u64 mul, u64 add, u64 div)
> > {
> >-	u64 q;
> >+	u64 rdx;
> > 
> >-	asm ("mulq %2; divq %3" : "=a" (q)
> >-				: "a" (a), "rm" (mul), "rm" (div)
> >-				: "rdx");
> >+	asm ("mulq %[mul]" : "+a" (rax), "=d" (rdx) : [mul] "rm" (mul));
> > 
> >-	return q;
> >+	if (!statically_true(!add))
> >+		asm ("addq %[add], %[lo]; adcq $0, %[hi]" :
> >+			[lo] "+r" (rax), [hi] "+r" (rdx) : [add] "irm" (add));
> >+
> >+	asm ("divq %[div]" : "+a" (rax), "+d" (rdx) : [div] "rm" (div));
> >+
> >+	return rax;
> > }
> >-#define mul_u64_u64_div_u64 mul_u64_u64_div_u64
> >+#define mul_u64_add_u64_div_u64 mul_u64_add_u64_div_u64
...
> 
> For the roundup case, I'm somewhat curious how this compares with doing:

I guess you are referring to the x86-64 asm version (left above).

>    cmp $1, %rdx
>    sbb $-1, %rax
> 
> ... after the division. At least it means not needing to compute d - 1,
> saving an instruction as well as a register.

> Unfortunately using an lea instruction to compute %rax (which otherwise
>  would incorporate both) isn't possible since it doesn't set the flags.
> 
> The cmp; sbb sequence should be no slower than add;
> adc – I'm saying "no slower" because %rdx is never written to,
> so I think this is provably a better sequence; whether or not it is
> measurable is another thing (but if we are tweaking this stuff...)

I wanted the same function as the non-x64-64 version and 'multiply and add'
possibly has other uses.

The instruction to calculate 'd - 1' (if not a constant) will usually
execute in parallel with an earlier instruction (eg the multiply)
so will be pretty much 'zero cost'.
The add/adc pair are in the 'register dependency chain' - so add a clock each.
The same is true for your cmp/sbb pair.

(Except on pre-broadwell Intel cpu where adc/sbb are two clocks.
I've lost the full reference, the initial changes fixed 'adc $0,x' and
generated the carry flag immediately and only delayed the result.
The doc said 'adc $0,reg' not 'adc $const,reg' - so maybe the sbb $-1
was two clocks for longer.)

	David