lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <05853n16-s64r-6976-q763-p9262p5o176n@syhkavp.arg>
Date: Mon, 19 Jan 2026 14:44:51 -0500 (EST)
From: Nicolas Pitre <nico@...xnic.net>
To: David Laight <david.laight.linux@...il.com>
cc: Andrew Morton <akpm@...ux-foundation.org>, 
    Eric Dumazet <edumazet@...gle.com>, 
    linux-kernel <linux-kernel@...r.kernel.org>, netdev@...r.kernel.org, 
    Jakub Kicinski <kuba@...nel.org>, Eric Dumazet <eric.dumazet@...il.com>, 
    Paolo Abeni <pabeni@...hat.com>
Subject: Re: [PATCH] compiler_types: Introduce inline_for_performance

On Mon, 19 Jan 2026, David Laight wrote:

> On Mon, 19 Jan 2026 10:47:51 -0500 (EST)
> Nicolas Pitre <nico@...xnic.net> wrote:
> 
> > On Sun, 18 Jan 2026, David Laight wrote:
> > 
> > > On 32bit you probably don't want to inline __arch_xprod_64(), but you do
> > > want to pass (bias ? m : 0) and may want separate functions for the
> > > 'no overflow' case (if it is common enough to worry about).  
> > 
> > You do want to inline it. Performance quickly degrades otherwise.
> 
> If it isn't inlined you want a real C function in div.c (or similar),
> not the compiler generating a separate body in the object file of each
> file that uses it.

Yes you absolutely do in this very particular case. This relies on a 
long sequence of code that collapses to only a few assembly instructions 
due to constant propagation. But most of the time gcc is not smart 
enough to realize that (strangely enough it used to be fine more than 10 
years ago). The corresponding function is not only slower but actually 
creates bigger code from the argument passing handling overhead.

> > And __arch_xprod_64() exists only for 32bit btw.
> 
> I wonder how much of a mess gcc makes of that code.
> I added asm functions for u64 mul_add(u32 a, u32 b, u32 c) calculating
> a * b + c without explicit zero extending any of the 32 bit values.
> Without that gcc runs out of registers and starts spilling to stack
> instead of just generating 'mul; add; adc $0'.

Here this is different. Let me copy the definition:

* Prototype: uint64_t __arch_xprod_64(const uint64_t m, uint64_t n, bool bias)
* Semantic:  retval = ((bias ? m : 0) + m * n) >> 64
* 
* The product is a 128-bit value, scaled down to 64 bits.
* Hoping for compile-time optimization of  conditional code.
* Architectures may provide their own optimized assembly implementation.

ARM32 provides its own definition. Last time I checked, RV32 already 
produced optimal code from the default C implementation.

> But 64bit systems without a 64x64=>128 multiply (ie without u128
> support) also need the 'multiply in 32bit chunks' code.

Again this is only for 32-bit systems. 64-bit systems use none of that.


Nicolas

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ