[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAOLP8p7vkZNexRdO6HK0sjCsKJ6sE-FAm+Sshxox_hK88unaPA@mail.gmail.com>
Date: Tue, 11 Mar 2014 05:24:26 -0400
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] multiply latency reduction via table lookups
On Tue, Mar 11, 2014 at 12:34 AM, Solar Designer <solar@...nwall.com> wrote:
> Is the "writing back" part essential? What if the writes are many times
> or even orders of magnitude less frequent than reads? Isn't it
> sufficient to ensure the data is dependent on the password and salt (so
> must be stored in a RAM rather than a ROM)? Frequent writes would hurt
> performance on CPUs with write-through L1 cache, including Bulldozer.
> Another drawback of frequent writes is that some of them may eventually
> propagate to main memory even on systems with write-back caches, which
> would waste some memory bandwidth.
I don't think writes are needed very often. In my repeat loop, I read
the same 2 blocks and hash them over and over, but only write it in
the last iteration.
>> It would have to be done carefully, since we also want to write to as
>> much memory as possible to maximize time*memory cost for an attacker,
>> so somehow we'd need to figure out a strategy for hammering L1 cache
>> without cache miss penalties while at the same time filling external
>> DRAM at the bandwidth it can support. The multiplications seem to
>> happen in parallel on the scalar unit, while memory hashing can happen
>> in the SIMD unit. Is it possible to do something similar with cache
>> vs external DRAM hashing?
>
> Yes, escrypt tries to do that. I thought TigerKDF did too.
>
>> I don't know how to do both at the same time... Alexander? This seems
>> like your kind of thing.
>
> The L1 cache reads may be made from the sub-block processing function.
> This may be happening while further sub-blocks are being prefetched (the
> prefetch instructions are issued when processing of a new block starts).
> This is what escrypt does.
I've played with prefetch instructions but never got any speed boost
from them. However, I never tried it while processing the current
block more than once. Also, I don't know the next "from" block
address in the second loop until the end of hashing the current block.
I could try using a value generated near, but not at the end, of block
hashing for finding the next block to prefetch.
> BTW, with blocks as large as what you use (multiple KiB rather than up
> to 1 KiB as I normally test escrypt with), the number of pending
> prefetches would definitely exceed what's supported in hardware.
> I don't know what exactly happens then (are some prefetch instructions
> simply ignored? or are there CPUs where they'd block?) I tested this
> briefly, and somehow didn't run into any significant performance
> penalty, but I think there must be room for optimization there (this may
> be a reason why I am not getting much of a speedup for larger r) - e.g.,
> for r > 8, we could be doing the prefetches in groups of 16 (or even
> fewer?) 64-byte sub-blocks, just a few sub-blocks before they're
> processed. I haven't tried that yet, focusing on r <= 8 for other
> reasons I mentioned (in other postings in here).
>
> Alexander
This could help explain why prefetch doesn't seem to help my speed at
all. I sometime test with blocks smaller than 1KiB, but tigerphs
performance tanks so rapidly as I decrease block size, I quickly
increase back to 4KiB or larger. It sounds like prefect makes a
bigger difference for small blocks.
I should play around with this a bit.
Bill
Powered by blists - more mailing lists