phc-discussions - Re: [PHC] TwoCats multiplication chain

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 22 Jun 2014 01:06:54 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] TwoCats multiplication chain

Bill, all -

Here's one more idea I'd like to have recorded in here, and it's about
multiplication latency hardening in general, not TwoCats specifically.

An attack which I think we haven't mentioned yet is speculative
computation of multiple possibilities of the next MUL before a previous
MUL completes.

For a simplified example, suppose there's exactly one MUL output bit
that limits the overall speed (it arrives late and/or is needed early,
overall more so than the rest of bits do/are).  We can trivially
eliminate this bottleneck, exposing whatever next bottleneck there may
be, by starting to compute the next MUL for both possible values of this
one bit before its actual value is known.  Ditto for the next bottleneck
bit, and so on.  To do it for e.g. 10 bits at once, we need 1024
multipliers - this is likely still affordable in terms of die area (it
may be less than the die area we're forcing the attacker to spend on
memory), but it might or might not be affordable in terms of routing and
extra MUXes (which introduce delays, too).

So this technique is of limited use, and is mitigated by having more
high-latency output and early-needed input bits than are efficiently
eliminated with speculative computation.

The existence of this technique, however, might affect which MUL output
bits are optimally fed (passing through another operation first) into
which next MUL input bits.

I commented on which MUL output bits to use in the message quoted below,
and the above is one more factor to consider... making the determination
even more difficult to make with any sort of confidence.  For now, all
we have are educated guesses, and we have our gut feeling that a
possible speedup is limited to ~2x as you say.  (That's not including
additional speedup from not needing to spend a cycle on XOR, shift, etc.,
which we do spend on CPU.  Maybe a future crypto-friendlier CPU should
deliver a suitable primitive that would execute in exactly as many clock
cycles as a MUL alone does.)

Alexander

I don't normally bottom-quote, but this time I did it on purpose:

On Tue, Mar 25, 2014 at 03:16:49PM -0400, Bill Cox wrote:
> On Tue, Mar 25, 2014 at 1:01 PM, Solar Designer <solar@...nwall.com> wrote:
> >> There is a multiplication poisoning issue, but with the 4 variables, I
> >> have not seen it occur after a billion iterations, so the percentage
> >> of blocks that would have poisoned chains is tiny.  The poisoning
> >> would occur if a and c both become 0 or 1 at the same time.
> >> Otherwise, if only a or c is 0 or 1, changes in b and d should kick
> >> them back into higher values.
> >
> > Is your use of four variables intended specifically to reduce the
> > frequency of such poisoning, or were there other reasons as well?
> 
> It's mainly for the poisoning issue.  With just two variables, one of
> them would become 1 pretty often.  It also gave me a chance to throw
> addition into the mix.
> 
> >> Will this do?
> >
> > Probably.
> >
> > Here's another thought I had, but didn't mention yet: what if the
> > attacker makes their multiplier circuit such that it's not constant
> > time, but rather that it may complete a cycle earlier in case the final
> > carry chain is interrupted (has 0's) in places allowing for such
> > speedup?  If sums of partial products for some of the output bits are
> > zero, then higher output bits don't depend on sums of partial products
> > for any bits lower than that.  This means that e.g. bit 63, which
> > potentially depends on everything due to this final carry, may in fact
> > be computed very quickly most of the time.
> 
> I'm not following this logic.  About half of the final output bits
> (the upper 32 - I discard the lower 32) will be zero, and the higher
> output bits still depend on all the inputs.  In a carry-save
> multiplier, the carries propagate in all the partial products, and can
> influence the next column in any of them.
> 
> There are a ton of ways to speed up a multiplier.  I know about Booth
> encoding because I used it once, but I don't know what's used today in
> an Intel processor.  Mathematically, there are tons of opportunities
> to reduce far below the 2X savings we get with Booth encoding, but
> having an efficient layout is important to reduce delays, so most of
> the opportunities are not used, IIRC.
> 
> > Higher 32 bits are probably in fact "harder" to speed up in hardware
> > than the lower 32, as you wrote, but once someone went for the effort of
> > this variable-time multiplier circuit, maybe it'd actually run faster
> > than it could otherwise?
> 
> I'm concerned there is still an opportunity to speed up a chain of
> multiplications and XOR by up to a 2X-ish factor, but not knowing much
> about modern multiplier design, I'm not sure.
> 
> For example, if we built a carry-save multiplier (which no one does),
> we could pipeline along the diagonal, rather than between multipliers.
>  Normally, the critical path in a carry-save multiplier is 64
> carry-save units long.  If we pipeline from the upper left to lower
> right through the middle of the multiplier, then the distance between
> flops seems to be 32 rather than 64.
> 
> However, other optimization techniques work against this speedup.  For
> example, Booth encoding reduces it to roughly a 16x33 array of carry
> save adder cells, saving almost 2X in area, and the adders are often
> carry-look-ahead or similar.  In this case, there's not much to be
> gained by pipelining within the multiplier.
> 
> > I've been thinking that maybe the middle 32 bits (16 to 47) are actually
> > the slowest to compute, even in a variable-time optimized circuit, but I
> > could be wrong.  This is more of a gut feeling than careful analysis.
> > And on 32-bit archs this double-register shift may be costly.  On 32-bit
> > x86, right now you simply pick EDX instead of EAX, but if going for bits
> > 16 to 47 you'd have a SHRD in there, wasting a cycle and thereby
> > probably killing any advantage this would provide.  On ARM, I think it'd
> > be even worse: LSR and OR-with-shift, meaning two cycles wasted.  (But
> > I'm not familiar with ARM.  I merely looked these up in arch reference.
> > Maybe there's a quicker way I overlooked.)  It's on 64-bit and on SIMD
> > with 64-bit vector elements that we incur a similar shift anyway, and
> > can adjust the shift count arbitrarily (be it 32 or 16).
> 
> I'm pretty confident the upper bits are the slowest to compute.
> However, my multiplication function has asymmetrical timing on the
> inputs.  One input is valid long before the other, and an attacker
> could design a multiplier optimized to be faster for one input than
> the other.  However, I'm not sure how to do that once past the first
> partial product, because all the outputs toggle based on the late
> arriving input.
> 
> > Another consideration in deciding between >> 32 vs. >> 16 vs. >> 20 (or
> > 21?) is that shifts by up to 20 (or 21?) only use multiplication result
> > bits computable in double-precision floating-point, making this more
> > JavaScript-friendly and maybe friendlier to some GPUs (such as high-end
> > NVIDIAs).  This has both pros and cons.  Since I do currently have an
> > arbitrary shift in a 64-bit value anyway (unlike you), this change is
> > within consideration for me (whereas it'd cost 1+ cycle of overhead for
> > you, so probably not worth it given that it has both pros and cons).
> > Maybe this shift count may be a parameter (choice between 32 and 20?)
> >
> > I think you shouldn't make further changes to this now unless a major
> > problem is found.  I'm just rambling.  I'd be interested in your
> > comments, though, as this might affect my work on escrypt.
> 
> I agree.  It's time to finalize the algorithms.
> 
> Bill