lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <5522AFC0.3040502@larc.usp.br> Date: Mon, 06 Apr 2015 13:09:36 -0300 From: Marcos Simplicio <mjunior@...c.usp.br> To: discussions@...sword-hashing.net Subject: Re: [PHC] Compute time hardness (pwxform,blake,blamka) Hi again. Apologies for the delay: I had almost no Internet access this weekend. On 03-Apr-15 08:12, Solar Designer wrote: > Hi Marcos, > > Thank you for running these benchmarks. They're very helpful. > > On Thu, Apr 02, 2015 at 06:22:38PM -0300, Marcos Simplicio wrote: >> - For 1 thread ("p1"), when both yescrypt and Lyra2 have the exact same >> number of passes through memory (Lyra2 with T=1 and yescrypt with T=2) >> and use an underlying hash with similar compute-time hardening (12 MULs, >> 12 ADDs and 12 XORs given by 1.5 BlaMka and pwxform) , then the lines >> almost overlap; so I believe we are near to a draw. > > This is nice to know. It's kind of a sanity check for both designs. > > There's a difference, though: I guess yescrypt's pwxform S-box lookups > are much more frequent than Lyra2's equivalent (is there one?) No, there is no table look-up or anything of the sort in any Blake variant I'm aware of. If so, > yescrypt actually performs more attack-discouraging work in the same > amount of time here. Can you compare Lyra2 and yescrypt in this respect? Well, I must confess I did not study pwxform in detail, so I'm not sure I can provide a fair comparison (everything I said about pwxform so far was based on what you said about it in terms of number of MULs :) ). OTOH, what I can say is actually a question: the S-box lookups are attack-discouraging when you look at GPUs, right? At least that is what I grasped from previous discussions on pwxform: - Multiplications are to limit speed-ups on ASICs, since CPUs have highly-optimized hardware for this task (I'm not sure about GPUs). - Look-up tables (LUTs) are GPU-unfriendly operations, at least in theory, but are in principle very fast in hardware. So far we were discussing hardware-oriented protection, so LUTs would not count much in the comparison. Of course, I may be missing something! BTW: we are still trying to verify the above assumptions on GPU- and FPGA-oriented unfriendliness with actual experiments. The GPU graphs should be ready shortly (we have a GPU specialist working on that), while FPGAs should take a little longer (a colleague professor allocated a student to work on that). > > Also, I never understood why you chose two full passes through memory as > the minimum for Lyra2. Is this arbitrary? For yescrypt, the 4/3 can be > shown as being the optimal stopping point assuming no AT-reducing TMTO, > so it's not arbitrary at all. OK, that is our fault: that should be clearer on our documentation... The rationale is a little long, so please bear with me. In regular executions with large amounts of memory, memory latency and bandwidth are expected to (and, experimentally, they do) play an important role. The result is that using much faster computation hardware in attacks may end up being of little use: it will not accelerate the attack by much, just like increasing the number of rounds 10 times in a legitimate platform does not slow down the execution 10 times either... Now, attackers could combine faster computation hardware with faster memory technology, thus taking full advantage of the former to accelerate attacks, in which case computation hardening indeed becomes an important requirement. However, using fast memory technology for a PHS with a high memory usage (e.g., 1GB of high-speed registers) is probably unrealistic, as it would become very expensive to test a single password, let alone several passwords in parallel. It would probably be more reasonable to buy more (less expensive) hardware and get a higher password testing throughput simply with parallel tests. On the other hand, using fast memory is very reasonable for accelerating operations that involve only a small amount of memory. This is exactly what a legitimate user's cache is for, so maximizing cache usage on the legitimate platform is (at least from this perspective) a good approach: attackers should have at least the same amount of cache to get the same speed-ups, while having much more cache only helps if the data to be processed is in that part of memory (e.g., if the required data can be prefetched, something reasonable in deterministic, cache-timing resistant schemes/passes, or, again, if the whole memory is a huge cache). There is, however, another situation that involves only small amounts of memory and, thus, in which attackers can benefit from faster memory technology: recomputations during a TMTO attack. After all, whether the scheme/pass is deterministic or password dependent, recomputations become password-independent as long as the attacker stores the sequence of rows required after seeing them in original computation; those rows can, thus, be prefetched and stored in cache-like memory. As a result, while the original computation of a block V_i (discarded) from a set of blocks \phi{V_i} took, say, 20\sigma (where \sigma accounts both for memory- and processing-related costs), a hardware acceleration of 10x combined with a 10x faster memory could bring the cost of recomputations to approximately 1 if \sigma is 50% memory and 50% processing; parallelism and pipelining, if possible, may allow recomputations to occur even faster. The lesson here is that, even though at first sight the time*memory cost would appear to the around 20, in practice it might be much lower than that. That is why computing the actual time*memory cost is tricky, especially without experiments to verify them, and why I believe TMTO resistance should involve a reasonable security margin (something that Argon and Lyra2, for example, try to achieve with their minimum settings). It is more or less like Rijndael adopting 10 rounds even though the Wide Trail Strategy upon which it was constructed would suggest 8 rounds as being enough (given that the 2 extra rounds may end up adding 5 active S-Boxes to attacks, only 10% of the 50 provided by 8 rounds). This desire for a security margin is what motivated Lyra2's t_min=1, which can be considered "arbitrary" but not "random": it was based on the security analysis we tried to provide together with the algorithm's documentation. Based on what we have seen as effective attacks, I believe that (and, again, that is open to debate!): 1) Running Lyra2 with a 1/3 pass after the memory is filled (i.e., with "T=1/3") is probably not as TMTO-resistant as yescrypt with similar settings (i.e., for "T=0"). The reason is that Lyra2's first pass (Setup phase) is cache-timing resistant and, thus, more vulnerable to TMTO attacks, while yescrypt's first pass trades cache-timing resistance for better TMTO resistance. The only caveat here, and the reason why I prefer to say "probably not", is that the extra writes done by Lyra2 during Setup may compensate at some extent this lower TMTO resistance. If that means that Lyra2 with "T=1/3" is comparable with yescrypt with "T=0", however, I cannot tell without further studies (and, preferably, experimental results!). 2) If Lyra2 adopted a first pass that was not cache-timing resistant (i.e., if the "row^0" indices were picked in a password dependent manner, just like in the Wandering phase), then it would probably be able to run with a lower value of T than yescrypt and still be as TMTO resistant. The reason here is that Lyra2's extra writes and reads are more likely to lead to recomputations in a TMTO attack. For example, if the memory has 1/2 of all rows required (assuming they are all updated): there is only a 25% chance that both row^0 and row^1 will be in memory, so recomputations will occur 75% of the time; if recomputing any row takes from 0 to \alpha (or \alpha/2 on average), recomputing two rows (i.e., 25% of the time) takes 3\alpha/2 on average; at least 2 rows are updated per call to the underlying sponge, so "keeping 1/2 of all rows required (assuming they are all updated)" does not mean an actual memory usage of 1/2, but that plus intermediary states. Yescrypt does have similar features, but it does not try to be as TMTO resistant. Hence, a "Lyra2d" (borrowing Argon2's notation to indicate a non-cache resistant variant) in principle would achieve TMTO resistance faster than yescrypt. Then, it *might* be fair to compare, say, Lyra2 with 7/6 passes through memory ("T=1/6") versus yescrypt with 4/3 passes through memory ("T=0"). Again, that does not necessarily balance the scale toward Lyra2, as it may very well be wasting cycles with memory-related operations trying to provide TMTO resistance, when it could be less TMTO-resistant per hashing operation and simply do more hashes. Nevertheless, this design choice was taken after we observed that Lyra2 could achieve high speed even with good TMTO resistance, which also means exploring a resource that has lower probability to be explored better (e.g., with faster technology) by attackers than by legitimate users: memory. Without further experimentation, however, I do not feel comfortable speculating further on potential drawbacks of each approach, though. I can only say that we do pay a price in terms of performance trying to provide some cache-timing resistance, while it might be better to go with Argon2's approach of separating things in "d" and "i" (explaining the difference to users, however, may not be very easy). Anyhow, I do not think t_min is critical for any scheme during the competition itself: a better understanding on attacks may motivate authors to set it higher or lower, and the final decision of what should be the minimum is probably something that needs to be discussed and sanctioned by the PHC panel together with the authors and the community in general. SHA-3's capacity is an example of such after-competition effort: although it is probably a bad example due to the debatable choices NIST tried to push, the idea of "fixing some knobs" after a scheme was selected is not bad per se. And that is also why comparisons with a similar number of passes through memory are useful, even if not definitive to say which scheme is "the best". > > I currently recommend that yescrypt be used with t=0 unless more time > can be spent and more memory can't be in a given case. > That is reasonable. I'm just not sure what is the "security margin" in practice (again, because computing time*memory cost is technology-dependent). Just to avoid misunderstandings: no FUD intended here! Unless someone can come up with an effective attack, there is no reason to think yescrypt's assumptions are wrong, so t=0 may be perfectly safe. >> - For 1 thread ("p1") with different passes through memory: say, >> yescrypt with T=0 does 4/3 passes though memory with 12 MUL+ADD+XOR, >> while Lyra2 with T=1 and BlaMka makes 2 passes through memory with 8 >> MUL+ADD+XOR are comparable (?) in terms of compute-time latency (4/3*12 >> = 2*8 = 16). Lyra2 in this case is slightly (3%) slower. Given the low >> difference, the reason is probably related to bandwidth usage, because >> Lyra2 purposely uses more bandwidth in this scenario: namely, it makes 2 >> writes per call of the underlying hash instead of 1, and 4 reads (2 from >> cache and 2 from memory) instead of 2. I'm not sure if that is good or >> bad, but improvements on memory bandwidth are usually slower than on >> computation power, so I would rather say it is good in terms of attack >> resistance, as it limits an attacker's parallelism ability. > > This makes sense, but you also need to factor in yescrypt's S-box lookups. I agree this factor is likely to play a role. However, Bill reported a long time ago speed-ups of ~20% if Lyra2 reduced its number of memory writes when no parallelism is employed. We preferred not do so in the 1-threaded version, however, because we _wanted_ attackers to run into memory bounds as soon as possible (again, because users and attackers are likely to have access to the same memory technology). Nevertheless, that discussion and further tests motivated a bandwidth reduction for higher p, which justifies the better results for p > 1: the extra cores provide both the higher bandwidth usage we wanted and similar TMTO security margins. The side effects are, however: 1) It does become memory bound with degrees of parallelism higher than ~4 2) More parallelism allows better performance of attacks using GPUs (e.g., the highest throughput of GPU-based attacks reported in the Reference Guide is for p = 4, the maximum p we tested) All in all, it is quite possible that Lyra2 is not as good at using the whole budget of processing cores as schemes with a lower bandwidth usage, unless further tweaking is done (outside the competition, of course). > >> - For 2 threads ("p2"), Lyra2 is faster for similar compute-time >> hardening and also for higher compute-time hardening. That (and a very >> old e-mail by Bill) is why we think bandwidth is likely to be at play >> for p=1: we reduce the number of writes to 1 per thread during the >> Wandering phase when more threads are active. >> >> - For 4 threads ("p4"), Lyra2 is again faster for similar compute-time >> hardening, and also faster for higher compute-time hardening. > > OK. I expect yescrypt's S-box lookups come into play here. In fact, > they also affect the 1 thread case, so I'm lucky yescrypt didn't appear > to run slower there. > > > > What machine is this, for these benchmarks? Sorry: I forgot to mention... It is a Intel Xeon E5-2430, with 48 GB of DRAM, running Ubuntu 14.04 LTS 64 bits (Note: this is the same computer used for the benchmarks in our reference guide. The number of physical cores is 6, however, not 12 as reported there: 12 is the number of threads) > Would you also benchmark Lyra2 vs. yescrypt for the maximum number of > threads supported in hardware on your machine, please? Sure! Attached are the graphs for 1-14 threads (more than 12, just to see what happens). As mentioned above, Lyra2 gets memory bound after ~4 cores, so there is little difference in performance between Blake2 or BlaMka with a different number of rounds. Yescrypt appears to scale better with more cores, apparently because it does not get (much) memory-bound. We notice that the performance gains seem to be more expressive up to the point the maximum number of physical cores is reached, although it may be a coincidence. One question here is: does yescrypt design tries to make attackers memory-bound too? I mean, it clearly takes better advantage of the "processing cores" budget, but cores are something that attackers are likely to have access more easily than defenders, while "better memory technology" not so much. This logic may be completely wrong (especially without experimental results to back it up), but, for better or for worse, that is something we considered when designing Lyra2. > Are both Lyra2 and yescrypt > built for the same SIMD intrinsics (and which)? Not really: Lyra2 uses SSE2 ("-msse2" flag), while yescrypt was built with the default makefile, with its "-march=native" flag. So, in the words of my student who run the benchmarks, "yescrypt is running SSE4.1 and/or AVX". I will have to check in more detail to avoid saying anything wrong, but it appears we are running yescrypt with more help from SIMD instructions than Lyra2... BTW, we have not yet used the suggestion by Samuel Neves regarding NORX tricks ("We used this trick (replacing + with ^, * with &) in NORX (pg. 23 of https://norx.io/data/norx.pdf); it does noticeably improve latency."). We will try and see what happens. > > There's significant room for improvement for yescrypt with AVX2, which I > haven't added yet. In fact, on pre-AVX2 CPUs the compute hardening can > be doubled by reducing the parallelism from 512-bit to 256-bit, with > only a moderate slowdown, so the hardening per defensive running time > would likely be improved by 50% or so - but I preferred the defaults to > be balanced, with newer and near future CPUs in consideration as well, > and also with CPU attacks considered (where an attacker could bring the > extra parallelism in from having multiple candidate passwords to test). That sounds reasonable. One caveat, though, is that if we really are going to take into account more limited platforms (e.g., 8- or 16-bit microcontrollers) as a possible use case, then fixing the underlying hash to pwxform may not be the best approach (possibly a "pwxformLite" would be necessary). > > Thanks again, > > Alexander > Download attachment "Lyra2xYescrypt_p6.png" of type "image/png" (106645 bytes) Download attachment "Lyra2xYescrypt_p8.png" of type "image/png" (97051 bytes) Download attachment "Lyra2xYescrypt_p10.png" of type "image/png" (97303 bytes) Download attachment "Lyra2xYescrypt_p12.png" of type "image/png" (92039 bytes) Download attachment "Lyra2xYescrypt_p14.png" of type "image/png" (104917 bytes) Download attachment "Lyra2xYescrypt_p1.png" of type "image/png" (153311 bytes) Download attachment "Lyra2xYescrypt_p2.png" of type "image/png" (127943 bytes) Download attachment "Lyra2xYescrypt_p4.png" of type "image/png" (115293 bytes)
Powered by blists - more mailing lists