[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <57srp3ps-n7p8-orqq-86rq-p04o2246pn7s@syhkavp.arg>
Date: Sun, 7 Jul 2024 21:21:18 -0400 (EDT)
From: Nicolas Pitre <nico@...xnic.net>
To: Arnd Bergmann <arnd@...db.de>
cc: Russell King <linux@...linux.org.uk>,
Linux-Arch <linux-arch@...r.kernel.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 4/4] __arch_xprod64(): make __always_inline when
optimizing for performance
On Sun, 7 Jul 2024, Arnd Bergmann wrote:
> On Sun, Jul 7, 2024, at 21:14, Nicolas Pitre wrote:
> > On Sun, 7 Jul 2024, Arnd Bergmann wrote:
> >
> >> On Sun, Jul 7, 2024, at 19:17, Nicolas Pitre wrote:
> >> > From: Nicolas Pitre <npitre@...libre.com>
> >> >
> >> > Recent gcc versions started not systematically inline __arch_xprod64()
> >> > and that has performance implications. Give the compiler the freedom to
> >> > decide only when optimizing for size.
> >> >
> >> > Signed-off-by: Nicolas Pitre <npitre@...libre.com>
> >>
> >> Seems reasonable. Just to make sure: do you know if the non-inline
> >> version of xprod_64 ends up producing a more effecient division
> >> result than the __do_div64() code path on arch/arm?
> >
> > __arch_xprod_64() is part of the __do_div64() code path. So I'm not sure
> > of your question.
> >
> > Obviously, having __arch_xprod_64() inlined is faster but it increases
> > binary size.
>
> I meant whether calling __div64_const32->__arch_xprod_64() is
> still faster for a constant base when the new __arch_xprod_64()
> is out of line, compared to the __div64_32->__do_div64()
> assembly code path we take for a non-constant base.
Oh, most likely yes. The non-constant base has to go through the whole
one-bit-at-a-time division loop whereas the constant base with
__div64_const32 results in 4 64-bits multiply and add. Moving
__arch_xprod_64() out of line adds the argument shuffling overhead and
it can't skip overflow handling, but still.
Here's some numbers. With latest patches using __always_inline:
test_div64: Starting 64bit/32bit division and modulo test
test_div64: Completed 64bit/32bit division and modulo test, 0.048285584s elapsed
Latest patches but __always_inline left out:
test_div64: Starting 64bit/32bit division and modulo test
test_div64: Completed 64bit/32bit division and modulo test, 0.053023584s elapsed
Forcing both constant and non-constant base through the same path:
test_div64: Starting 64bit/32bit division and modulo test
test_div64: Completed 64bit/32bit division and modulo test, 0.103263776s elapsed
It is worth noting that test_div64 does half the test with non constant
divisors already so the impact is greater than what those numbers show.
And for what it is worth, those numbers were obtained using QEMU. The
gcc version is 14.1.0.
Nicolas
Powered by blists - more mailing lists