[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <05853n16-s64r-6976-q763-p9262p5o176n@syhkavp.arg>
Date: Mon, 19 Jan 2026 14:44:51 -0500 (EST)
From: Nicolas Pitre <nico@...xnic.net>
To: David Laight <david.laight.linux@...il.com>
cc: Andrew Morton <akpm@...ux-foundation.org>,
Eric Dumazet <edumazet@...gle.com>,
linux-kernel <linux-kernel@...r.kernel.org>, netdev@...r.kernel.org,
Jakub Kicinski <kuba@...nel.org>, Eric Dumazet <eric.dumazet@...il.com>,
Paolo Abeni <pabeni@...hat.com>
Subject: Re: [PATCH] compiler_types: Introduce inline_for_performance
On Mon, 19 Jan 2026, David Laight wrote:
> On Mon, 19 Jan 2026 10:47:51 -0500 (EST)
> Nicolas Pitre <nico@...xnic.net> wrote:
>
> > On Sun, 18 Jan 2026, David Laight wrote:
> >
> > > On 32bit you probably don't want to inline __arch_xprod_64(), but you do
> > > want to pass (bias ? m : 0) and may want separate functions for the
> > > 'no overflow' case (if it is common enough to worry about).
> >
> > You do want to inline it. Performance quickly degrades otherwise.
>
> If it isn't inlined you want a real C function in div.c (or similar),
> not the compiler generating a separate body in the object file of each
> file that uses it.
Yes you absolutely do in this very particular case. This relies on a
long sequence of code that collapses to only a few assembly instructions
due to constant propagation. But most of the time gcc is not smart
enough to realize that (strangely enough it used to be fine more than 10
years ago). The corresponding function is not only slower but actually
creates bigger code from the argument passing handling overhead.
> > And __arch_xprod_64() exists only for 32bit btw.
>
> I wonder how much of a mess gcc makes of that code.
> I added asm functions for u64 mul_add(u32 a, u32 b, u32 c) calculating
> a * b + c without explicit zero extending any of the 32 bit values.
> Without that gcc runs out of registers and starts spilling to stack
> instead of just generating 'mul; add; adc $0'.
Here this is different. Let me copy the definition:
* Prototype: uint64_t __arch_xprod_64(const uint64_t m, uint64_t n, bool bias)
* Semantic: retval = ((bias ? m : 0) + m * n) >> 64
*
* The product is a 128-bit value, scaled down to 64 bits.
* Hoping for compile-time optimization of conditional code.
* Architectures may provide their own optimized assembly implementation.
ARM32 provides its own definition. Last time I checked, RV32 already
produced optimal code from the default C implementation.
> But 64bit systems without a 64x64=>128 multiply (ie without u128
> support) also need the 'multiply in 32bit chunks' code.
Again this is only for 32-bit systems. 64-bit systems use none of that.
Nicolas
Powered by blists - more mailing lists