lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Tue, 11 Mar 2014 05:24:26 -0400
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] multiply latency reduction via table lookups

On Tue, Mar 11, 2014 at 12:34 AM, Solar Designer <solar@...nwall.com> wrote:
> Is the "writing back" part essential?  What if the writes are many times
> or even orders of magnitude less frequent than reads?  Isn't it
> sufficient to ensure the data is dependent on the password and salt (so
> must be stored in a RAM rather than a ROM)?  Frequent writes would hurt
> performance on CPUs with write-through L1 cache, including Bulldozer.
> Another drawback of frequent writes is that some of them may eventually
> propagate to main memory even on systems with write-back caches, which
> would waste some memory bandwidth.

I don't think writes are needed very often.  In my repeat loop, I read
the same 2 blocks and hash them over and over, but only write it in
the last iteration.

>> It would have to be done carefully, since we also want to write to as
>> much memory as possible to maximize time*memory cost for an attacker,
>> so somehow we'd need to figure out a strategy for hammering L1 cache
>> without cache miss penalties while at the same time filling external
>> DRAM at the bandwidth it can support.  The multiplications seem to
>> happen in parallel on the scalar unit, while memory hashing can happen
>> in the SIMD unit.  Is it possible to do something similar with cache
>> vs external DRAM hashing?
>
> Yes, escrypt tries to do that.  I thought TigerKDF did too.
>
>> I don't know how to do both at the same time... Alexander?  This seems
>> like your kind of thing.
>
> The L1 cache reads may be made from the sub-block processing function.
> This may be happening while further sub-blocks are being prefetched (the
> prefetch instructions are issued when processing of a new block starts).
> This is what escrypt does.

I've played with prefetch instructions but never got any speed boost
from them.  However, I never tried it while processing the current
block more than once.  Also, I don't know the next "from" block
address in the second loop until the end of hashing the current block.

I could try using a value generated near, but not at the end, of block
hashing for finding the next block to prefetch.

> BTW, with blocks as large as what you use (multiple KiB rather than up
> to 1 KiB as I normally test escrypt with), the number of pending
> prefetches would definitely exceed what's supported in hardware.
> I don't know what exactly happens then (are some prefetch instructions
> simply ignored? or are there CPUs where they'd block?)  I tested this
> briefly, and somehow didn't run into any significant performance
> penalty, but I think there must be room for optimization there (this may
> be a reason why I am not getting much of a speedup for larger r) - e.g.,
> for r > 8, we could be doing the prefetches in groups of 16 (or even
> fewer?) 64-byte sub-blocks, just a few sub-blocks before they're
> processed.  I haven't tried that yet, focusing on r <= 8 for other
> reasons I mentioned (in other postings in here).
>
> Alexander

This could help explain why prefetch doesn't seem to help my speed at
all.  I sometime test with blocks smaller than 1KiB, but tigerphs
performance tanks so rapidly as I decrease block size, I quickly
increase back to 4KiB or larger.  It sounds like prefect makes a
bigger difference for small blocks.

I should play around with this a bit.

Bill

Powered by blists - more mailing lists