lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <57srp3ps-n7p8-orqq-86rq-p04o2246pn7s@syhkavp.arg>
Date: Sun, 7 Jul 2024 21:21:18 -0400 (EDT)
From: Nicolas Pitre <nico@...xnic.net>
To: Arnd Bergmann <arnd@...db.de>
cc: Russell King <linux@...linux.org.uk>, 
    Linux-Arch <linux-arch@...r.kernel.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 4/4] __arch_xprod64(): make __always_inline when
 optimizing for performance

On Sun, 7 Jul 2024, Arnd Bergmann wrote:

> On Sun, Jul 7, 2024, at 21:14, Nicolas Pitre wrote:
> > On Sun, 7 Jul 2024, Arnd Bergmann wrote:
> >
> >> On Sun, Jul 7, 2024, at 19:17, Nicolas Pitre wrote:
> >> > From: Nicolas Pitre <npitre@...libre.com>
> >> >
> >> > Recent gcc versions started not systematically inline __arch_xprod64()
> >> > and that has performance implications. Give the compiler the freedom to
> >> > decide only when optimizing for size.
> >> >
> >> > Signed-off-by: Nicolas Pitre <npitre@...libre.com>
> >> 
> >> Seems reasonable. Just to make sure: do you know if the non-inline
> >> version of xprod_64 ends up producing a more effecient division
> >> result than the __do_div64() code path on arch/arm?
> >
> > __arch_xprod_64() is part of the __do_div64() code path. So I'm not sure 
> > of your question.
> >
> > Obviously, having __arch_xprod_64() inlined is faster but it increases 
> > binary size.
> 
> I meant whether calling __div64_const32->__arch_xprod_64() is
> still faster for a constant base when the new __arch_xprod_64()
> is out of line, compared to the __div64_32->__do_div64()
> assembly code path we take for a non-constant base.

Oh, most likely yes. The non-constant base has to go through the whole 
one-bit-at-a-time division loop whereas the constant base with 
__div64_const32 results in 4 64-bits multiply and add. Moving 
__arch_xprod_64() out of line adds the argument shuffling overhead and 
it can't skip overflow handling, but still.

Here's some numbers. With latest patches using __always_inline:

test_div64: Starting 64bit/32bit division and modulo test
test_div64: Completed 64bit/32bit division and modulo test, 0.048285584s elapsed

Latest patches but __always_inline left out:

test_div64: Starting 64bit/32bit division and modulo test
test_div64: Completed 64bit/32bit division and modulo test, 0.053023584s elapsed

Forcing both constant and non-constant base through the same path:

test_div64: Starting 64bit/32bit division and modulo test
test_div64: Completed 64bit/32bit division and modulo test, 0.103263776s elapsed

It is worth noting that test_div64 does half the test with non constant 
divisors already so the impact is greater than what those numbers show.

And for what it is worth, those numbers were obtained using QEMU. The 
gcc version is 14.1.0.


Nicolas

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ