lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 29 Aug 2014 15:12:47 -0400
From: Bill Cox <>
Subject: Re: [PHC] A review per day - TwoCats

Hash: SHA1

On 08/28/2014 09:27 PM, Solar Designer wrote:
> On Thu, Aug 28, 2014 at 07:38:42PM -0400, Bill Cox wrote:
>> One thing I screwed up was how to use t_cost.
> You can probably fix that during the tweaks period, if TwoCats
> becomes a PHC finalist.

I just changed my local version of TwoCats to iterate the last
unpredictable slice hashing 2^t_cost times.  It did not increase the
complexity of the reference version, and it simplified the SIMD
optimized version.  Running out of L1 cache, hashing 16KiB and
repeating the last 4KiB 2^20 times on one thread, 4 lanes, with 4KiB
block size, but a 16 byte sub-block size takes 0.895s.

twocats> time twocats -m 4 -t20 -P1 -l4 -b4096 -B16 -M0
hash:blake2s memCost:4 timeCost:20 multiplies:0 lanes:4 parallelism:1
algorithm:twocats-extended password:password salt:salt blockSize:4096

79 58 86 5c a1 e3 b4 ed
3a 08 74 0f 0f 6d 5c 09
35 00 3f b9 4b 02 4b 69
b7 d7 7a c8 62 38 65 69      32 (octets)

real	0m0.888s
user	0m0.887s
sys	0m0.002s

I also just tested it on your Haswell machine.  Thanks for access!

twocats> time twocats -m 5 -t20 -l8 -b1024 -B32 -M0 -P8
hash:blake2s memCost:5 timeCost:20 multiplies:0 lanes:8 parallelism:8
algorithm:twocats-extended password:password salt:salt blockSize:1024

c8 39 15 ac f8 22 3b 9b
e9 af 65 98 b8 86 d3 11
93 70 ce 2f 10 f1 d9 03
db 4c b6 b1 21 94 2a 22      32 (octets)

real	0m0.579s
user	0m4.416s
sys	0m0.000s

This is a total of 356GiB/s hashed.  Not too bad :-)  If TwoCats
survives to the next round I'll ask if I can submit the update.

>> That's all I have for now!  Please feel free to dump on TwoCats!
>> It's my turn :-)
> OK, you asked for it.  I don't recall all of my complaints about
> TwoCats anymore, but I think its small random lookups, which are
> meant to be bcrypt-like, might not actually provide as much GPU
> resistance as bcrypt's did.  I think TwoCats makes them 256-bit
> only, meaning that on pre-AVX2 CPUs they might not be frequent
> enough, and you might not be making enough of them per byte
> read/written to the large arena.

This was a dumb oversight on my part that was easy to fix.  I simply
need to allow subBlock size to be any multiple of the number of
lanes*4, rather than forcing it to a multiple of 64.  For Sandy Bridge
and Ivy Bridge, there is no reason I can't allow sub-block sizes of 16
bytes, and for older mmx processors with only 64-bit SIMD registers,
allowing block sizes of 8 bytes makes sense.  For architectures with
no SIMD unit at all, there is no reason I can't allow 4 byte sub-blocks.

This was a 1-line change in my local version of TwoCats in
twocats-common.c, in the verifyParameters function.  All I had to do
was insure subBlockSize was at least lanes*4, rather than 64.  The
benchmarks above make use of this fixed sub-block size.

> Have you actually checked TwoCats vs. defense-optimized bcrypt
> running on the same CPU when exhausting the machine's capacity
> (e.g., 8 independent concurrent hash computations on your i7-3770),
> in terms of frequency of those small random lookups?

I haven't benchmarked bcrypt yet.  I don't know enough about the code
to know if I'm running the latest SSE optimized version, etc.
However, from the first run above, I'm doing 2.8B 16-byte random reads
per second on Ivy Bridge.

> And then you need to factor in how soon their results are needed
> (the available parallelism, compare it against bcrypt's four S-box
> lookups) and the total size of your equivalent of bcrypt S-boxes
> (if 4x larger, then you probably have 4x more room for parallelism
> while staying the same as bcrypt in terms of GPU resistance).

I just looked at bcrypt again (your OpenWall version).  If I'm reading
this accurately, it does 64 pseudo-random reads of 4 bytes each in
each call to BF_ENCRYPT.  I don't see any writes to the S-boxes during
the compute intensive phase.  Is that right?  I think a round should
be doable in a clock cycle without any memory accesses using an async
SRAM, or possibly even synthesizing the S-boxes to gates.

How fast does bcrypt do random reads?

> As I had mentioned, the results of such testing were one of my
> reasons to keep yescrypt's default pwxform rounds count no lower
> than 6, or alternatively I'd need to reduce instruction-level
> parallelism.  I think you did not include such criteria when
> deciding on the amount of work to do along with the main arena
> reads/writes in TwoCats.  You simply wanted to maximize your memory
> bandwidth usage, well and the number of sequential MULs, which is
> related, but not exactly it.

I did simply maximize bandwidth.  I did not worry about keeping the
SIMD unit's pipeline full.  The penalty for CPUs without a SIMD unit
would be too high, IMO.

SinnyCat, which is always single-threaded, hashes 1GiB in 0.23s, using
no SIMD instructions at all (-m64 -mtune=generic).  TwoCats should run
well on low-end CPUs.

Version: GnuPG v1


Powered by blists - more mailing lists