lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 5 Oct 2018 17:16:08 +0200
From:   Ard Biesheuvel <ard.biesheuvel@...aro.org>
To:     "Jason A. Donenfeld" <Jason@...c4.com>,
        Ard Biesheuvel <ard.biesheuvel@...aro.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        "<netdev@...r.kernel.org>" <netdev@...r.kernel.org>,
        "open list:HARDWARE RANDOM NUMBER GENERATOR CORE" 
        <linux-crypto@...r.kernel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Samuel Neves <sneves@....uc.pt>,
        Andy Lutomirski <luto@...nel.org>,
        Jean-Philippe Aumasson <jeanphilippe.aumasson@...il.com>,
        Russell King <linux@...linux.org.uk>,
        linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
        peter@...ptojedi.org
Subject: Re: [PATCH net-next v6 19/23] zinc: Curve25519 ARM implementation

On 5 October 2018 at 17:05, D. J. Bernstein <djb@...yp.to> wrote:
> For the in-order ARM Cortex-A8 (the target for this code), adjacent
> multiply-add instructions forward summands quickly. A simple in-order
> dot-product computation has no latency problems, while interleaving
> computations, as suggested in this thread, creates problems. Also, on
> this microarchitecture, occasional ARM instructions run in parallel with
> NEON, so trying to manually eliminate ARM instructions through global
> pointer tracking wouldn't gain speed; it would simply create unnecessary
> code-maintenance problems.
>
> See https://cr.yp.to/papers.html#neoncrypto for analysis of the
> performance of---and remaining bottlenecks in---this code. Further
> speedups should be possible on this microarchitecture, but, for anyone
> interested in this, I recommend focusing on building a cycle-accurate
> simulator (e.g., fixing inaccuracies in the Sobole simulator) first.
>
> Of course, there are other ARM microarchitectures, and there are many
> cases where different microarchitectures prefer different optimizations.
> The kernel already has boot-time benchmarks for different optimizations
> for raid6, and should do the same for crypto code, so that implementors
> can focus on each microarchitecture separately rather than living in the
> barbaric world of having to choose which CPUs to favor.
>

Thanks Dan for the insight.

We have already established in a separate discussion that Cortex-A7,
which is main optimization target for future development, does not
have the microarchitectural peculiarity that you are referring to that
ARM instructions are essentially free when interleaved with NEON code.

But I take your point re benchmarking (as I already indicated in my
reply to Jason): if we optimize towards speed, we should ideally reuse
the existing benchmarking infrastructure we have to select the fastest
implementation at runtime. For instance, it turns out that scalar
ChaCha20 is almost as fast as NEON (or even faster?) on A7, and using
NEON in the kernel has some issues of its own.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ