lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 12 Jan 2022 21:27:40 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     "'Jason A. Donenfeld'" <Jason@...c4.com>,
        Eric Biggers <ebiggers@...nel.org>
CC:     Linux Crypto Mailing List <linux-crypto@...r.kernel.org>,
        Netdev <netdev@...r.kernel.org>,
        WireGuard mailing list <wireguard@...ts.zx2c4.com>,
        LKML <linux-kernel@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
        "Geert Uytterhoeven" <geert@...ux-m68k.org>,
        Theodore Ts'o <tytso@....edu>,
        "Greg Kroah-Hartman" <gregkh@...uxfoundation.org>,
        Jean-Philippe Aumasson <jeanphilippe.aumasson@...il.com>,
        Ard Biesheuvel <ardb@...nel.org>,
        "Herbert Xu" <herbert@...dor.apana.org.au>
Subject: RE: [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size
 on small systems

From: Jason A. Donenfeld
> Sent: 12 January 2022 18:51
> 
> On Wed, Jan 12, 2022 at 7:32 PM Eric Biggers <ebiggers@...nel.org> wrote:
> > How about unrolling the inner loop but not the outer one?  Wouldn't that give
> > most of the benefit, without hurting performance as much?
> >
> > If you stay with this approach and don't unroll either loop, can you use 'r' and
> > 'i' instead of 'i' and 'j', to match the naming in G()?
> 
> All this might work, sure. But as mentioned earlier, I've abandoned
> this entirely, as I don't think this patch is necessary. See the v3
> patchset instead:
> 
> https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@zx2c4.com/

I think you mentioned in another thread that the buffers (eg for IPv6
addresses) are actually often quite short.

For short buffers the 'rolled-up' loop may be of similar performance
to the unrolled one because of the time taken to read all the instructions
into the I-cache and decode them.
If the loop ends up small enough it will fit into the 'decoded loop
buffer' of modern Intel x86 cpu and won't even need decoding on
each iteration.

I really suspect that the heavily unrolled loop is only really fast
for big buffers and/or when it is already in the I-cache.
In real life I wonder how often that actually happens?
Especially for the uses the kernel is making of the code.

You need to benchmark single executions of the function
(doable on x86 with the performance monitor cycle counter)
to get typical/best clocks/byte figures rather than a
big average for repeated operation on a long buffer.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ