[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAELGc4VWiwk3v0FTG_uQMakPreqEbk_c9d+TLyhdQ7iNcLsN6A@mail.gmail.com>
Date: Fri, 17 Oct 2014 19:03:28 +0800
From: Hongjun Wu <wuhongjun@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] POMELO
Hi Alexander,
Thank you for the analysis and comments!
1)  In your mail, "POMELO claims to be resistant to SIMD attacks. First, I
think we need to clarify: this may only apply if the hardware lacks
efficient gather"
      Be more accurate, POMELO is resistant to SIMD attacks if the hardware
lacks efficient gather and scatter.
      Fortunately (or unfortunately), currently there is no efficient
gather/scatter for random data accesses (according to what I know. Maybe I
am wrong here).   On the GPUs with dozens of parallel data processor, the
gather/scatter for random data access has very long latency.  Even on the
latest Haswell CPU, it seems that the gather/scatter is implemented in
sequential rather than in parallel (taking around 10 cycles for gather the
data in L2 cache).   It is reasonable since parallel gather/scatter for
random data access requires much more extra circuits for accessing memory.
2)  In your email, "I think POMELO partially fails at anti-SIMD, despite of
claims in its specification document.  It defeats SIMD for defense (if we
don't consider hashing multiple passwords within one instance, which is
cumbersome and is a risk), but not as much for attack. "
     You are right here.  This partial failure affects the security.  But I
think that the effect on security is also partial.  The random data
read/write in H remains the bottleneck, and it cannot be easily bypassed in
SIMD, especially on GPU.
3)  I think that your tradeoff (one table lookup in every step) is
interesting.  That is my old design before optimization.  As you have
noticed, this tradeoff has some advantage, also some disadvantage.
     The rationale for the current POMELO version (one table lookup in 4
steps) is to achieve certain balance between table lookup and the
computation of F when the data is not in L3 cache (for the current desktop
CPUs).
      (The best optimization is to set different parameters for different
table sizes.  But the design would become too complicated, and inconvenient
to the users.)
Best Regards,
Hongjun
On Fri, Oct 17, 2014 at 3:17 PM, Solar Designer <solar@...nwall.com> wrote:
> Hi,
>
> I first brought this up as part of a broader discussion of PHC
> candidates on the panel list earlier this month, and I think it's my
> duty to post it to the public discussions list.  So here goes.
> (I am reusing pieces from two of my postings to the panel list, and
> adding some more content/clarifications here.)
>
> POMELO claims to be resistant to SIMD attacks.  First, I think we need
> to clarify: this may only apply if the hardware lacks efficient gather
> loads - a detail which may be obvious, but is somehow missing from the
> author's analysis.  A lot of current and planned general-purpose chips
> (AVX2 and AVX-512 capable CPUs, Xeon Phi, many GPUs) do have gather
> loads (of varying efficiency), so I think this must not be omitted.
> In the below paragraph, I assume that hardware does lack efficient
> gather loads:
>
> I think POMELO partially fails at anti-SIMD, despite of claims in its
> specification document.  It defeats SIMD for defense (if we don't
> consider hashing multiple passwords within one instance, which is
> cumbersome and is a risk), but not as much for attack.  In fact, I think
> there's a tradeoff here as it relates to possible POMELO tweaks (if
> limited to the currently hard-coded parameters): it can defeat SIMD more
> fully, but become even less suitable (slower) for large memory hashing.
>
> Samuel Neves confirmed the understanding that "POMELO does seem to allow
> some vectorization" (_not_ relying on gather loads), and asked for more
> detail on the tweak I had in mind.  Here it is:
>
> I meant simply making those data-dependent accesses in H more frequent,
> by removing the "i mod 4 == 3" condition or making this a tunable
> parameter ("i mod freq == (freq - 1)" or "i mod freq == 0").  Maybe G
> should have the same tweak applied, or maybe not (or maybe it should
> have its access frequency tunable separately).
>
> Comparison of POMELO tweaked as above against bcrypt and other schemes,
> not limited to the SIMD aspect (will also apply for MIMD, as long as
> global memory is accessed and behaves similar to how it does on current
> CPUs and GPUs):
>
> Since F takes roughly as much time as a Blowfish round does (but F lacks
> data-dependent accesses), doing one data-dependent access per H
> invocation (rather than one per four H invocations, as it is done now)
> means that step 6 will perform these accesses "only" ~4 times less
> frequently than bcrypt does (as opposed to ~16 times less frequently, as
> is the case now), but at the same amount of parallelism that bcrypt has
> (since the four Blowfish S-box lookups can be performed in parallel).
> As a result, step 6 will become about as local memory attack resistant
> as bcrypt is, but still ~4x less resistant (than bcrypt) to global
> memory attacks (as long as the memory usage per instance is low enough
> that we don't bump into total global memory size).  Since a half of the
> total running time will be spent in step 4, overall POMELO with this
> tweak and at bcrypt-like defensive memory usage might be ~2x weaker than
> bcrypt at local memory attacks and ~8x at global memory attacks.  Of
> course, it is meant to be used at higher m_cost - in fact, m_cost = 0
> corresponds to 8 KiB, which is twice the bcrypt's size.  8 KiB should
> actually work fine on typical defenders' systems, but once we exceed L1
> cache size, POMELO with this tweak should see performance impact similar
> to what we see with Pufferfish - well, maybe up to ~4x lower impact,
> though, because the data-dependent lookups are ~4x less frequent.  It
> will take some benchmarking to see at what m_cost settings such tweaked
> POMELO is fast enough compared to other PHC candidates.  Not tweaking G
> or otherwise making its access frequency much lower than H's may help
> here (making step 4 take considerably less time than step 6 does).
>
> One aspect I have not considered in the analysis above is that POMELO's
> data-dependent accesses in H are read-writes, as opposed to bcrypt's
> pure reads.  This may make POMELO up to 2x more resistant to some of the
> attacks above (where it'd mean twice more accesses over the memory bus),
> but it may also mean that it'd run somewhat slower defensively than it
> would with pure reads.  Yet I think these read-writes are a good idea
> at least when POMELO is used at low m_cost (corresponding to L1 cache).
>
> Overall, I like POMELO's simplicity and it trying to work well over a
> wide range of possible m_cost settings, yet I think this combination (of
> simplicity and wide range) makes it sub-optimal at most settings.
>
> Alexander
>
Content of type "text/html" skipped
Powered by blists - more mailing lists
 
