linux-kernel - Re: [PATCH 1/3] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250406103516.53a32bca@pumpkin>
Date: Sun, 6 Apr 2025 10:35:16 +0100
From: David Laight <david.laight.linux@...il.com>
To: Nicolas Pitre <npitre@...libre.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org,
 Uwe Kleine-König <u.kleine-koenig@...libre.com>, Oleg
 Nesterov <oleg@...hat.com>, Peter Zijlstra <peterz@...radead.org>, Biju Das
 <biju.das.jz@...renesas.com>
Subject: Re: [PATCH 1/3] lib: Add mul_u64_add_u64_div_u64() and
 mul_u64_u64_div_u64_roundup()

On Sat, 5 Apr 2025 21:46:25 -0400 (EDT)
Nicolas Pitre <npitre@...libre.com> wrote:

> On Sat, 5 Apr 2025, David Laight wrote:
> 
> > The existing mul_u64_u64_div_u64() rounds down, a 'rounding up'
> > variant needs 'divisor - 1' adding in between the multiply and
> > divide so cannot easily be done by a caller.
> > 
> > Add mul_u64_add_u64_div_u64(a, b, c, d) that calculates (a * b + c)/d
> > and implement the 'round down' and 'round up' using it.
> > 
> > Update the x86-64 asm to optimise for 'c' being a constant zero.
> > 
> > For architectures that support u128 check for a 64bit product after
> > the multiply (will be cheap).
> > Leave in the early check for other architectures (mostly 32bit) when
> > 'c' is zero to avoid the multi-part multiply.
> > 
> > Note that the cost of the 128bit divide will dwarf the rest of the code.
> > This function is very slow on everything except x86-64 (very very slow
> > on 32bit).
> > 
> > Add kerndoc definitions for all three functions.
> > 
> > Signed-off-by: David Laight <david.laight.linux@...il.com>  
> 
> Reviewed-by: Nicolas Pitre <npitre@...libre.com>
> 
> Sidenote: The 128-bits division cost is proportional to the number of 
> bits in the final result. So if the result is 0x0080000000000000 then 
> the loop will execute only once and exit early.

Some performance measurements for the test cases:
0: ok    50    25    19    19    19    19    19    19    19    19 mul_u64_u64_div_u64 
1: ok     2     0     0     0     0     0     0     0     0     0 mul_u64_u64_div_u64 
2: ok     4     4     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
3: ok     4     4     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
4: ok     4     4     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
5: ok    15     8     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
6: ok   275   225   223   223   223   223   223   224   224   223 mul_u64_u64_div_u64 
7: ok     9     6     4     4     4     4     5     5     4     4 mul_u64_u64_div_u64 
8: ok   241   192   187   187   187   187   187   188   187   187 mul_u64_u64_div_u64 
9: ok   212   172   169   169   169   169   169   169   169   169 mul_u64_u64_div_u64 
10: ok    12     6     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
11: ok     9     5     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
12: ok     6     4     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
13: ok     6     5     5     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
14: ok     4     4     5     5     4     4     4     4     4     5 mul_u64_u64_div_u64 
15: ok    18    12     8     8     8     8     8     8     8     8 mul_u64_u64_div_u64 
16: ok    18    11     6     6     6     6     6     6     6     6 mul_u64_u64_div_u64 
17: ok    22    16    11    11    11    11    11    11    11    11 mul_u64_u64_div_u64 
18: ok    25    18     9     9     9     9     9    10     9    10 mul_u64_u64_div_u64 
19: ok   272   231   227   227   226   227   227   227   227   226 mul_u64_u64_div_u64 
20: ok   198   188   185   185   185   186   185   185   186   186 mul_u64_u64_div_u64 
21: ok   207   198   196   196   196   196   196   196   196   196 mul_u64_u64_div_u64 
22: ok   201   189   190   189   190   189   190   189   190   189 mul_u64_u64_div_u64 
23: ok   224   184   181   181   181   181   180   180   181   181 mul_u64_u64_div_u64 
24: ok   238   185   179   179   179   179   179   179   179   179 mul_u64_u64_div_u64 
25: ok   208   178   177   177   177   177   177   177   177   177 mul_u64_u64_div_u64 
26: ok   170   146   139   139   139   139   139   139   139   139 mul_u64_u64_div_u64 
27: ok   256   204   196   196   196   196   196   196   196   196 mul_u64_u64_div_u64 
28: ok   227   195   194   195   194   195   194   195   194   195 mul_u64_u64_div_u64 

Entry 0 is an extra test and is the test overhead - subtracted from the others.
Each column is clocks for one run of the test, but for this set I'm running
the actual test 16 times and later dividing the clock count by 16.
It doesn't make much difference though, so the cost of the
	mfence; rdpmc; mfence; nop_test; mfence; rdpmc; mfence
really is about 190 clocks (between the rdpmc results).

As soon as you hit a non-trival case the number of clocks increases
dramatically.

This is on a zen5 in 64bit mode ignoring the u128 path.
(I don't have the packages installed to compile a 32bit binary).

Maybe I can compile it for arm32, it'll need the mfence and rdpmc changing.
But maybe something simple will be ok on a pi-5.

(oh and yes, I didn't need to include autoconf.h)

	David