phc-discussions - Re: [PHC] TwoCats multiplication chain

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOLP8p5U9jc6js5OEcKfkgA9JRC9kb5eXEbNwDpKBMnbJAAuOw@mail.gmail.com>
Date: Tue, 25 Mar 2014 15:16:49 -0400
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] TwoCats multiplication chain

On Tue, Mar 25, 2014 at 1:01 PM, Solar Designer <solar@...nwall.com> wrote:
>> There is a multiplication poisoning issue, but with the 4 variables, I
>> have not seen it occur after a billion iterations, so the percentage
>> of blocks that would have poisoned chains is tiny.  The poisoning
>> would occur if a and c both become 0 or 1 at the same time.
>> Otherwise, if only a or c is 0 or 1, changes in b and d should kick
>> them back into higher values.
>
> Is your use of four variables intended specifically to reduce the
> frequency of such poisoning, or were there other reasons as well?

It's mainly for the poisoning issue.  With just two variables, one of
them would become 1 pretty often.  It also gave me a chance to throw
addition into the mix.

>> Will this do?
>
> Probably.
>
> Here's another thought I had, but didn't mention yet: what if the
> attacker makes their multiplier circuit such that it's not constant
> time, but rather that it may complete a cycle earlier in case the final
> carry chain is interrupted (has 0's) in places allowing for such
> speedup?  If sums of partial products for some of the output bits are
> zero, then higher output bits don't depend on sums of partial products
> for any bits lower than that.  This means that e.g. bit 63, which
> potentially depends on everything due to this final carry, may in fact
> be computed very quickly most of the time.

I'm not following this logic.  About half of the final output bits
(the upper 32 - I discard the lower 32) will be zero, and the higher
output bits still depend on all the inputs.  In a carry-save
multiplier, the carries propagate in all the partial products, and can
influence the next column in any of them.

There are a ton of ways to speed up a multiplier.  I know about Booth
encoding because I used it once, but I don't know what's used today in
an Intel processor.  Mathematically, there are tons of opportunities
to reduce far below the 2X savings we get with Booth encoding, but
having an efficient layout is important to reduce delays, so most of
the opportunities are not used, IIRC.

> Higher 32 bits are probably in fact "harder" to speed up in hardware
> than the lower 32, as you wrote, but once someone went for the effort of
> this variable-time multiplier circuit, maybe it'd actually run faster
> than it could otherwise?

I'm concerned there is still an opportunity to speed up a chain of
multiplications and XOR by up to a 2X-ish factor, but not knowing much
about modern multiplier design, I'm not sure.

For example, if we built a carry-save multiplier (which no one does),
we could pipeline along the diagonal, rather than between multipliers.
 Normally, the critical path in a carry-save multiplier is 64
carry-save units long.  If we pipeline from the upper left to lower
right through the middle of the multiplier, then the distance between
flops seems to be 32 rather than 64.

However, other optimization techniques work against this speedup.  For
example, Booth encoding reduces it to roughly a 16x33 array of carry
save adder cells, saving almost 2X in area, and the adders are often
carry-look-ahead or similar.  In this case, there's not much to be
gained by pipelining within the multiplier.

> I've been thinking that maybe the middle 32 bits (16 to 47) are actually
> the slowest to compute, even in a variable-time optimized circuit, but I
> could be wrong.  This is more of a gut feeling than careful analysis.
> And on 32-bit archs this double-register shift may be costly.  On 32-bit
> x86, right now you simply pick EDX instead of EAX, but if going for bits
> 16 to 47 you'd have a SHRD in there, wasting a cycle and thereby
> probably killing any advantage this would provide.  On ARM, I think it'd
> be even worse: LSR and OR-with-shift, meaning two cycles wasted.  (But
> I'm not familiar with ARM.  I merely looked these up in arch reference.
> Maybe there's a quicker way I overlooked.)  It's on 64-bit and on SIMD
> with 64-bit vector elements that we incur a similar shift anyway, and
> can adjust the shift count arbitrarily (be it 32 or 16).

I'm pretty confident the upper bits are the slowest to compute.
However, my multiplication function has asymmetrical timing on the
inputs.  One input is valid long before the other, and an attacker
could design a multiplier optimized to be faster for one input than
the other.  However, I'm not sure how to do that once past the first
partial product, because all the outputs toggle based on the late
arriving input.

> Another consideration in deciding between >> 32 vs. >> 16 vs. >> 20 (or
> 21?) is that shifts by up to 20 (or 21?) only use multiplication result
> bits computable in double-precision floating-point, making this more
> JavaScript-friendly and maybe friendlier to some GPUs (such as high-end
> NVIDIAs).  This has both pros and cons.  Since I do currently have an
> arbitrary shift in a 64-bit value anyway (unlike you), this change is
> within consideration for me (whereas it'd cost 1+ cycle of overhead for
> you, so probably not worth it given that it has both pros and cons).
> Maybe this shift count may be a parameter (choice between 32 and 20?)
>
> I think you shouldn't make further changes to this now unless a major
> problem is found.  I'm just rambling.  I'd be interested in your
> comments, though, as this might affect my work on escrypt.

I agree.  It's time to finalize the algorithms.

Bill