lists.openwall.net  lists / announce owlusers owldev johnusers johndev passwdqcusers yescrypt popa3dusers / osssecurity kernelhardening musl sabotage tlsify passwords / cryptdev xvendor / Bugtraq FullDisclosure linuxkernel linuxnetdev linuxext4 linuxhardening linuxcveannounce PHC  
Open Source and information security mailing list archives
 

MessageID: <20140621210654.GA23087@openwall.com> Date: Sun, 22 Jun 2014 01:06:54 +0400 From: Solar Designer <solar@...nwall.com> To: discussions@...swordhashing.net Subject: Re: [PHC] TwoCats multiplication chain Bill, all  Here's one more idea I'd like to have recorded in here, and it's about multiplication latency hardening in general, not TwoCats specifically. An attack which I think we haven't mentioned yet is speculative computation of multiple possibilities of the next MUL before a previous MUL completes. For a simplified example, suppose there's exactly one MUL output bit that limits the overall speed (it arrives late and/or is needed early, overall more so than the rest of bits do/are). We can trivially eliminate this bottleneck, exposing whatever next bottleneck there may be, by starting to compute the next MUL for both possible values of this one bit before its actual value is known. Ditto for the next bottleneck bit, and so on. To do it for e.g. 10 bits at once, we need 1024 multipliers  this is likely still affordable in terms of die area (it may be less than the die area we're forcing the attacker to spend on memory), but it might or might not be affordable in terms of routing and extra MUXes (which introduce delays, too). So this technique is of limited use, and is mitigated by having more highlatency output and earlyneeded input bits than are efficiently eliminated with speculative computation. The existence of this technique, however, might affect which MUL output bits are optimally fed (passing through another operation first) into which next MUL input bits. I commented on which MUL output bits to use in the message quoted below, and the above is one more factor to consider... making the determination even more difficult to make with any sort of confidence. For now, all we have are educated guesses, and we have our gut feeling that a possible speedup is limited to ~2x as you say. (That's not including additional speedup from not needing to spend a cycle on XOR, shift, etc., which we do spend on CPU. Maybe a future cryptofriendlier CPU should deliver a suitable primitive that would execute in exactly as many clock cycles as a MUL alone does.) Alexander I don't normally bottomquote, but this time I did it on purpose: On Tue, Mar 25, 2014 at 03:16:49PM 0400, Bill Cox wrote: > On Tue, Mar 25, 2014 at 1:01 PM, Solar Designer <solar@...nwall.com> wrote: > >> There is a multiplication poisoning issue, but with the 4 variables, I > >> have not seen it occur after a billion iterations, so the percentage > >> of blocks that would have poisoned chains is tiny. The poisoning > >> would occur if a and c both become 0 or 1 at the same time. > >> Otherwise, if only a or c is 0 or 1, changes in b and d should kick > >> them back into higher values. > > > > Is your use of four variables intended specifically to reduce the > > frequency of such poisoning, or were there other reasons as well? > > It's mainly for the poisoning issue. With just two variables, one of > them would become 1 pretty often. It also gave me a chance to throw > addition into the mix. > > >> Will this do? > > > > Probably. > > > > Here's another thought I had, but didn't mention yet: what if the > > attacker makes their multiplier circuit such that it's not constant > > time, but rather that it may complete a cycle earlier in case the final > > carry chain is interrupted (has 0's) in places allowing for such > > speedup? If sums of partial products for some of the output bits are > > zero, then higher output bits don't depend on sums of partial products > > for any bits lower than that. This means that e.g. bit 63, which > > potentially depends on everything due to this final carry, may in fact > > be computed very quickly most of the time. > > I'm not following this logic. About half of the final output bits > (the upper 32  I discard the lower 32) will be zero, and the higher > output bits still depend on all the inputs. In a carrysave > multiplier, the carries propagate in all the partial products, and can > influence the next column in any of them. > > There are a ton of ways to speed up a multiplier. I know about Booth > encoding because I used it once, but I don't know what's used today in > an Intel processor. Mathematically, there are tons of opportunities > to reduce far below the 2X savings we get with Booth encoding, but > having an efficient layout is important to reduce delays, so most of > the opportunities are not used, IIRC. > > > Higher 32 bits are probably in fact "harder" to speed up in hardware > > than the lower 32, as you wrote, but once someone went for the effort of > > this variabletime multiplier circuit, maybe it'd actually run faster > > than it could otherwise? > > I'm concerned there is still an opportunity to speed up a chain of > multiplications and XOR by up to a 2Xish factor, but not knowing much > about modern multiplier design, I'm not sure. > > For example, if we built a carrysave multiplier (which no one does), > we could pipeline along the diagonal, rather than between multipliers. > Normally, the critical path in a carrysave multiplier is 64 > carrysave units long. If we pipeline from the upper left to lower > right through the middle of the multiplier, then the distance between > flops seems to be 32 rather than 64. > > However, other optimization techniques work against this speedup. For > example, Booth encoding reduces it to roughly a 16x33 array of carry > save adder cells, saving almost 2X in area, and the adders are often > carrylookahead or similar. In this case, there's not much to be > gained by pipelining within the multiplier. > > > I've been thinking that maybe the middle 32 bits (16 to 47) are actually > > the slowest to compute, even in a variabletime optimized circuit, but I > > could be wrong. This is more of a gut feeling than careful analysis. > > And on 32bit archs this doubleregister shift may be costly. On 32bit > > x86, right now you simply pick EDX instead of EAX, but if going for bits > > 16 to 47 you'd have a SHRD in there, wasting a cycle and thereby > > probably killing any advantage this would provide. On ARM, I think it'd > > be even worse: LSR and ORwithshift, meaning two cycles wasted. (But > > I'm not familiar with ARM. I merely looked these up in arch reference. > > Maybe there's a quicker way I overlooked.) It's on 64bit and on SIMD > > with 64bit vector elements that we incur a similar shift anyway, and > > can adjust the shift count arbitrarily (be it 32 or 16). > > I'm pretty confident the upper bits are the slowest to compute. > However, my multiplication function has asymmetrical timing on the > inputs. One input is valid long before the other, and an attacker > could design a multiplier optimized to be faster for one input than > the other. However, I'm not sure how to do that once past the first > partial product, because all the outputs toggle based on the late > arriving input. > > > Another consideration in deciding between >> 32 vs. >> 16 vs. >> 20 (or > > 21?) is that shifts by up to 20 (or 21?) only use multiplication result > > bits computable in doubleprecision floatingpoint, making this more > > JavaScriptfriendly and maybe friendlier to some GPUs (such as highend > > NVIDIAs). This has both pros and cons. Since I do currently have an > > arbitrary shift in a 64bit value anyway (unlike you), this change is > > within consideration for me (whereas it'd cost 1+ cycle of overhead for > > you, so probably not worth it given that it has both pros and cons). > > Maybe this shift count may be a parameter (choice between 32 and 20?) > > > > I think you shouldn't make further changes to this now unless a major > > problem is found. I'm just rambling. I'd be interested in your > > comments, though, as this might affect my work on escrypt. > > I agree. It's time to finalize the algorithms. > > Bill
Powered by blists  more mailing lists