[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150430041307.GB20660@openwall.com>
Date: Thu, 30 Apr 2015 07:13:07 +0300
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] Fastest algorithm shootout with threads...
On Wed, Apr 29, 2015 at 03:47:35PM -0700, Bill Cox wrote:
> Just for fun, I added -p support for TwoCats. I still win the speed
> competition :-)
Cool! How many sequential MULs does TwoCats have per 1 KiB in that test?
> In short, here's the best runtimes, when choosing num threads for highest
> speed at hashing 1GiB:
>
> TwoCats 103 ms
> Argon2d-sse 143 ms
> Yescrypt-2pw-sse 147 ms
> Lyra2-sse 212 ms
The problem is most apps and users won't run additional benchmarks to
choose the optimal number of threads, and even if they do they'd likely
need to be computing the same derived key on an upgraded machine later,
where the old thread count would no longer be the most optimal one.
> Yescrypt does 33% more I/O operations than TwoCats, and uses OpenMP rather
> than pthreads, and uses a 1KiB block size compared to TwoCat's 16KiB block
> size. This explains the difference in speed. Lyra2 does a lot more I/O
> operations than TwoCats, explaining it's bandwidth-limited speed.
> Argon2d-sse also uses a 1KiB block size, and in addition, it calls 16
> reduced Blake2b rounds rather than 12 like Lyra2. I think this explains
> why it's running slower.
Unless I misunderstand something, Argon2d at lowest t_cost only goes
through memory once (when filling it), without the extra 33% you
mentioned for yescrypt. So it could have reached TwoCats' speed, but
somehow did not. I guess the intermediate writes to state[] between the
two groups of 8 BLAKE2b rounds might hurt even with write-back caches.
Are those writes occasionally getting to higher-level caches or RAM?
> Should both Yescrypt and Argon2d to use my distance-cubed distribution?
> Argon2d uses a uniform distribution, and Yescrypt uses a
> sliding-power-of-two distribution.
It's confusing to say that Argon2d uses a uniform distribution. Both
Argon2d and yescrypt use uniform distribution within their target range
(all of memory written so far or the current window, respectively) at
each given moment, but as we know this results in different non-uniform
distributions across the entire memory filling loop's running time.
> This would enhance Argon2d's TMTO
> defense and Yescrypt could probably not run the second loop at all when
> t_cost == 0. That would make Yescrypt-2pw-sse nearly as fast as TwoCats
> when running with enough threads.
Yes, perhaps. It's a pity that TwoCats' TMTO wasn't studied by the
Argon team. Maybe we should have made TwoCats a finalist just for that
privilege. Then we would have greater confidence about your approach.
> These runs filter out some noise using the -r10 (run 10 times) flag. I
> still ran it 3 times and picked the lowest.
>
> Argon2d-sse
> -----------------
>
> $ ./tst-argon2d-sse -p6 -m1045000 -r10 -t0
> 8 32 1045000 0 173 1041464
>
> $ ./tst-argon2d-sse -p12 -m1045000 -r10 -t0
> 8 32 1045000 0 143 1041372
>
> Yescrypt-2pw-sse
> ------------------------
>
> $ ./tst-yescrypt-2pw-sse -p6 -m17 -r10 -t0
> 8 32 17 0 175 1044532
>
> $ ./tst-yescrypt-2pw-sse -p12 -m17 -r10 -t0
> 8 32 17 0 147 1044208
>
> Lyra2-sse
> -------------
>
> $ ./tst-lyra2-sse -p6 -m43700 -t1 -r10
> 8 32 43700 1 212 1044536
>
> $ ./tst-lyra2-sse -p12 -m43700 -t1 -r10
> 8 32 43700 1 236 1044544
>
> TwoCats
> ------------
>
> $ ./tst-twocats -p6 -m20 -t0 -r 10
> 8 32 20 0 103 1044860
>
> $ ./tst-twocats -p12 -m20 -t0 -r 10
> 8 32 20 0 117 1044908
Oh, this is what I wanted to see. I think it's actually a drawback that
Lyra2 and TwoCats exhibit a performance reduction at more threads.
OTOH, this implies they're probably not susceptible to faster CPU
attacks (by interleaving multiple instances) on this machine at this
m_cost. (Argon2d and yescrypt might not be either, but it's not seen
from these specific benchmarks.)
Alexander
Powered by blists - more mailing lists