[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <3adf7f7b-a46e-4368-a87a-a217a8a8f9d1@app.fastmail.com>
Date: Fri, 05 Jul 2024 10:17:18 +0200
From: "Arnd Bergmann" <arnd@...db.de>
To: "Nicolas Pitre" <nico@...xnic.net>, "Russell King" <linux@...linux.org.uk>
Cc: "Nicolas Pitre" <npitre@...libre.com>,
Linux-Arch <linux-arch@...r.kernel.org>, linux-kernel@...r.kernel.org,
llvm@...ts.linux.dev, "Nathan Chancellor" <nathan@...nel.org>
Subject: Re: [PATCH 2/2] asm-generic/div64: reimplement __arch_xprod64()
On Fri, Jul 5, 2024, at 04:20, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@...libre.com>
>
> Several years later I just realized that this code could be optimized
> and more importantly simplified even further. With some reordering, it
> is possible to dispense with overflow handling entirely and still have
> optimal code.
>
> There is also no longer a reason to have the possibility for
> architectures to override the generic version. Only ARM did it and these
> days the compiler does a better job than the hand-crafted assembly
> version anyway.
>
> Kernel binary gets slightly smaller as well. Using the ARM's
> versatile_defconfig plus CONFIG_TEST_DIV64=y:
>
> Before this patch:
>
> text data bss dec hex filename
> 9644668 2743926 193424 12582018 bffc82 vmlinux
>
> With this patch:
>
> text data bss dec hex filename
> 9643572 2743926 193424 12580922 bff83a vmlinux
>
> Signed-off-by: Nicolas Pitre <npitre@...libre.com>
This looks really nice, thanks for the work!
I've tried reproducing your finding to see what compiler
version started being good enough to benefit from the
new version. Looking at just the vmlinux size as you did
above, I can confirm that the generated code is noticeably
smaller in gcc-11 and above, slightly smaller in gcc-10
but larger in gcc-9 and below. With gcc-10 being 4 years
old now and already in debian 'oldstable', that should be
good enough.
Unfortunately, I see that clang-19 still produces smaller
arm code with the old version, so this is likely missing
some optimization that went into gcc. Specifically these
are the numbers I see for an armv7 defconfig with many
drivers disabled for faster builds, comparing the current
upstream version with inline asm, the upstream C version
(patch 1/2 applied) and the new C version (both applied):
text data bss dec hex filename
6332190 2577094 257664 9166948 8be064 vmlinux-old-asm
6334518 2577158 257664 9169340 8be9bc vmlinux-old-C
6333366 2577158 257664 9168188 8be53c vmlinux-new-C
The results for clang-14 are very similar. Adding Nathan
and the llvm linux mailing list to see if anyone there
thinks we need to dig deeper on whether llvm should handle
this better.
I also checked a few other 32-bit targets with gcc-14
and found that mips and powerpc get slightly worse with
your new version, while x86 doesn't use this code and
is unaffected.
With all this said, I think we still want your patch
or something very close to it because the new version
is so much more readable and better on the one 32-bit
config that users care about in practice (armv7 with
modern gcc), but it would be nice if we could find
a way to not make it worse for the other configurations.
Arnd
Powered by blists - more mailing lists