lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <20131116061744.GC6367@openwall.com> Date: Sat, 16 Nov 2013 10:17:44 +0400 From: Solar Designer <solar@...nwall.com> To: Daniel Franke <dfoxfranke@...il.com> Cc: discussions@...sword-hashing.net Subject: Re: [PHC] The EARWORM password hash Daniel, On Sun, Aug 11, 2013 at 11:09:40PM -0400, Daniel Franke wrote: > EARWORM is a password hash designed to be bound by memory bandwidth. > It is inspired somewhat by Solar Designer's concept of "ROM-port > hardness", and has in common the benefit that it can demand an almost > arbitrarily large amount of (read-only) memory for an almost > arbitrarily small amount of time. This should make it an appealing > choice for busy web servers that need to process a high volume of > login requests at low latency. I am happy to hear that you're experimenting with this. We're going to make a PHC submission built around this concept and some other ideas as well (yes, its complexity is a drawback, so we'll need cut-down special-purpose versions as well). > EARWORM has some drawbacks that will likely make it unappealing for > most other use cases. Using EARWORM only makes sense if its ROM is > kept resident in memory most or all of the time, because for most > reasonable choices of time and space parameters, loading the ROM from > disk will take a lot longer than the hash computation itself. Hogging > a big chunk of RAM 24x7 is fine if it's actually in use for most of > that time, but it's an undue burden if you're using it twice a day to > hash your laptop password. Right. How do you propose the ROM be initialized? Is this within scope of EARWORM itself or totally external to it? This ROM initialization step for our password hashing scheme involves some non-trivial tradeoffs. One approach is to fill a disk file with random data (and discard/wipe the CSPRNG seed, if one existed) and have it read into memory each time the authentication service starts up. Unfortunately, at 100+ GB sizes (and this is what our current target is, given modern servers' affordable RAM sizes vs. botnet nodes' RAM sizes), reading this from disk will be taking a while. Even with SSDs this is several minutes, and it may be preferable to avoid the cost of SSDs in those servers - better to have more CPU and RAM - unless there's an even larger ROM-on-SSD, as I had proposed too. Another approach is to fill the would-be-ROM with data generated from a seed (a local parameter) on authentication service start up. While this can be done quicker than reading from disk, the resulting data is susceptible to efficient TMTO attacks. For a trivial example, if we use the exact same algorithm as scrypt SMix's first loop does, then the same kind of TMTO would apply to the ROM (an attacker with less memory would store every Nth element and would recompute the rest on the fly). In fact, this scrypt-like TMTO not only lets an attacker use less memory efficiently, but it also reduces the area-time cost for ASICs (also relevant e.g. for GPUs), albeit by up to a small constant factor. What's worse, the TMTO is even more relevant for memory bandwidth bound schemes than for scrypt, because it frees up memory bandwidth, thereby allowing for more parallel cores to access the same ROM. That's smaller ROM _and_ more parallel cores - it's a clear win for the attacker. Thus, we need a TMTO-discouraging way to initialize the ROM, yet we need it to run reasonably quickly. This is a security vs. service startup time tradeoff. One of its aspects is parallelism available during ROM initialization - or lack thereof. If we split the ROM into smaller chunks, we can have multiple threads work on the initialization independently, but this may also benefit the kind of attacker who recomputes/discards portions of the ROM in a smaller RAM (such as on a botnet node) whenever needed (during a password hash computation). Another case where initializing the ROM by (fairly large) chunks is helpful is being able to upgrade to a larger ROM size (after server upgrade) for newly set passwords while preserving the initial portion of the ROM (so that it can be used for computing both old and new hashes). This may similarly benefit some attackers too, so is a tradeoff. One approach to partially mitigate this drawback is to have newer chunks (with higher offsets) depend on data in older chunks - but unfortunately not vice versa since the newer chunks couldn't(?) have existed when the older ones were computed and put to use before the server upgrade. Or could they? Sort of yes: by using a TMTO during ROM initialization on the smaller server, in anticipation of a possible upgrade. These are some of the challenges we're going through for Openwall's upcoming PHC submission - although I guess we could just keep this aspect beyond scope of PHC (the reference implementation interface does not allow/require the initialization logic to be included). ;-) > EARWORM is designed to take advantage of Intel AES-NI instructions, > and takes a steep performance hit if they're not available. This makes > it a poor choice for use on mobile devices, or in sandboxed > environments (e.g. web browsers) that do not allow native code > execution. Have you measured how steep this performance hit is? How are you handling this case in your source code? For example, do you have drop-in replacements for AES-NI intrinsics for use on machines without such instructions? We're considering AES-NI too. A drawback it has is lack of scalability beyond 128-bit vector width. It's essentially 128-bit scalar. Yes, EARWORM has lots of parallelism around AES-NI (which is OK, unlike in scrypt, because EARWORM is memory bandwidth bound rather than memory-hard) - but near-future CPUs with AVX-512 are not expected to offer any SIMD form of AES-NI, so EARWORM won't run on them any faster, whereas a SIMD-scalable alternative could run 4x faster (unless, of course, it'd hit the [also upgraded] memory bandwidth sooner). > In spite of these drawbacks, I think EARWORM will fill its chosen > niche more effectively than anything else proposed to date. "To date", yes. We've been working to change that. ;-) EARWORM is focused only on memory bandwidth (the scarce resource) and high parallelism (the threats model permits to use lots of parallelism). We're trying to get the best of both worlds, referring to scrypt-like sequential memory-hardness and ROM port hardness: http://www.openwall.com/presentations/ZeroNights2012-New-In-Password-Hashing/mgp00009.html So we'll have this hybrid, and it'll likely exploit (for defense) other properties of the memory subsystem as well (beyond what was mentioned in the ZeroNights presentation). The drawback is complexity. I think that for the places where this is to be deployed with the full functionality (including the ROM), this level of complexity is OK - but it is great that there will be the (likely simpler) EARWORM as well. (For some other places, such as crypt(3) to be put into a Unix system's libc or libcrypt, something simpler is desirable - but EARWORM is not suitable for that anyway, so not comparable.) I'll over-quote a little since this thread I am replying to is rather old: > Anyway, without further ado, here is a pseudocode description of > EARWORM. > > || denotes concatenation. > > The function > > prf(secret,salt,outlen) > > is PBKDF2-SHA256-HMAC with an iteration count of 1. outlen is in bytes. > > The function > > aesround(block,roundkey) > > is the round function for AES encryption. The Intel AESENC instruction > computes precisely this function. It is curious that Colin chose to use Salsa20 with 8 rounds in essentially the same place, whereas you're using just one round of AES. What risks (or other drawbacks) can there be if this transform is too simple? > CHUNK_LENGTH, CHUNK_WIDTH, and WORKUNIT_DEPTH are integer constants, > provisionally defined to be 128, 4, and 256 respectively. > > The function > > workunit(secret,salt,m_cost,arena,outlen) > > takes the following inputs: > > * 'secret' is the password being hashed, considered as an octet > string of variable length. > > * 'salt' is another octet string of variable length. > > * 'm_cost' is the memory cost parameter. > > * 'arena' is a read-only three-dimensional array of 128-bit words, > with dimensions (2**m_cost) by CHUNK_LENGTH by CHUNK_WIDTH. > > * 'outlen' is the desired output length. > > The arena's entire contents are random, initialized once by the site > administrator and then reuseable indefinitely. > > workunit() maintains the following state variables: > > * 'scratchpad' is a one-dimensional array of CHUNK_WIDTH 128-bit words. > > * 'index_a' and 'index_b' are integers. > > Its body is as follows: > > begin workunit(secret,salt,m_cost,arena,outlen) > > index_a || index_b := prf(secret, 0x00 || salt, 32) > index_a := mod(index_a, 2**m_cost) > index_b := mod(index_b, 2**m_cost) > > scratchpad[0 .. (CHUNK_WIDTH - 1)] := > prf(secret, 0x01 || salt, 16 * CHUNK_WIDTH) > > for d from 0 to (WORKUNIT_DEPTH / 2) > > for l from 0 to (CHUNK_LENGTH - 1) > for w from 0 to (CHUNK_WIDTH - 1) > scratchpad[w] := aesround(scratchpad[w], arena[index_a][l][w]) > end for > end for > > index_a := mod(scratchpad[0], 2**m_cost) > > for l from 0 to (CHUNK_LENGTH - 1) > for w from 0 to (CHUNK_WIDTH - 1) > scratchpad[w] := aesround(scratchpad[w], arena[index_b][l][w]) > end for > end for > > index_b := mod(scratchpad[0], 2**m_cost) > > end for > > return prf(scratchpad[0 .. (CHUNK_WIDTH - 1)], 0x02 || salt, outlen) > end workunit The above has what we'd consider a flaw under our threat model (attackers with many smaller memory machines), but you might not (you mentioned using this with only 256 MiB anyway). scratchpad[] elements beyond index 0 are only needed for the final prf() computation, so instead of working on the full arena[] under each workunit, an implementation could efficiently work on portions of the (re-structured) arena[] on a machine with a smaller amount of fast memory (and swap those portions outside of the "for d from 0 to (WORKUNIT_DEPTH / 2)" loop). > The EARWORM hash function is then computed by calling workunit() > t_cost times with distinct salts, and XORing together the > outputs. (Hence, workunits can be computed in parallel). > > The structure of the workunit function provides opportunities for some > nice optimizations. Presuming that the length/width/depth constants > are known at compile-time, all loops can be unrolled, and all the > function's mutable state (scratchpad, index_a, and index_b) can be > held in registers. Therefore, the 'l' and 'w' loops can be optimized > down to nothing but a long series of AESENC instructions. Since > index_a and index_b can be computed right after one 'l' loop but then > are not used until the after the completion of the next one, chunks > of ROM can be prefetched and be warm in L1 cache by the time they're > needed. Have you experimented with such prefetching (using the corresponding CPU instructions)? Similarly, have you tried removing the second index variable (and thus the ability to prefetch this much in advance) - does it affect performance on current CPUs or does it not matter (yet)? > Experimentation on my Opteron 6376 workstation shows that computing a > single workunit takes about 350 microseconds, and that two CPU cores > are sufficient to saturate my memory bandwidth (PC3-12800 ECC, > quad-channel). I'd rather avoid saturating the memory bandwidth with only two threads, because this means you're not making full use of your CPU's execution units (even those few that you do make use of at all). Yes, this is OK per your threats model (unlike per scrypt's), yet if we can do a little bit better, why not do it. OK, maybe you're trying to ensure you'd also almost saturate the memory bandwidth with the same settings on a machine without AES-NI? On a related note, you're invited to use our dev box for your testing and benchmarks: http://openwall.info/wiki/HPC/Village (I am now experimenting with a 112 GB "ROM" on this 128 GB RAM machine). Thanks, Alexander
Powered by blists - more mailing lists