lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Thu, 23 Apr 2015 13:05:20 +0300
From: Solar Designer <>
Subject: Re: [PHC] enhancing Argon2 (was: Competition process)


On Tue, Apr 21, 2015 at 05:33:18AM +0300, Solar Designer wrote:
> On Mon, Apr 20, 2015 at 11:08:42PM +0200, Dmitry Khovratovich wrote:
> > Regarding your benchmarks, the impact of extra operations on 8 threads
> > is so low, I guess, because the bottleneck is the memory bandwidth
> > rather than the CPU.
> Right.  I expect the impact at low m_cost (where all or a significant
> portion of memory fits in a cache) to be much worse.  We need to run
> such benchmarks as well, and then decide on the un-pwxform rounds count.
> I actually thought of having it vary by m_cost in yescrypt, but decided
> against that so far because it would be non-intuitive.  So I think a
> single default should be chosen for Argon2 as well.

I think that besides memory bandwidth, there's another reason why the
performance impact of the MAXFORM chain (shall we call it that?) is
lower when running more threads:

When running 2 threads per Intel's core or AMD's module, the scalar and
SIMD instructions are mixed within that core's or module's reorder
buffer more evenly.  While gcc does a pretty good job at instruction
scheduling, and modern CPUs' reorder buffers are pretty large, there's
probably room for improvement.

You might have noticed that I experimented with relevant gcc flags in
the Makefile, leaving it with "-fsched2-use-superblocks
-fsched-stalled-insns=0" in the patch that I posted.  Those two options
provide a less than 1% improvement for me, so you should probably not
include them.  However, you may look into mixing the MAXFORM rounds with
BLAKE2 ones at a lower level, perhaps introducing them into a revised
BLAKE2_ROUND macro (which you'd use only in ComputeBlock).  It currently
includes G1, G2, DIAGONALIZE, G1, G2, UNDIAGONALIZE - giving you up to 6
convenient opportunities to insert uses of MAXFORM_ROUND in there.  This
will probably result in a more even mix of instructions in the compiled
code, helping the single thread performance, as well as in-order CPUs.


Powered by blists - more mailing lists