phc-discussions - Re: [PHC] Combining sliding window and bit-reversal

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140130013231.GA12613@openwall.com>
Date: Thu, 30 Jan 2014 05:32:31 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] Combining sliding window and bit-reversal

On Tue, Jan 28, 2014 at 04:05:29AM +0400, Solar Designer wrote:
> On Mon, Jan 27, 2014 at 11:49:37PM +0000, Samuel Neves wrote:
> > You can get decent bit reversal speed using PSHUFB, by reversing up to
> > 32 4-bit chunks per cycle.
> 
> Yes, but unfortunately this adds overhead of going from scalar to vector
> and back (we need to bit-reverse a single 32-bit or 64-bit value), and
> SIMD instructions generally have lower throughput than scalar ones.  Yet
> it might be the fastest approach.  Will need to benchmark.

I haven't benchmarked yet, but I realized that it's actually OK that
this works in vector form, since we readily know what further indices
needing reversal will be.  We can simply perform the vectorized bit
reversal operation every 4 loop iterations, or whatever matches the
vector size.  In fact, a smart (but overly complicated?) implementation
could split the full vector width into fields of arbitrary length in
bytes, so e.g. while i <= 0xff, we'll only do one SIMD bit reversal
every 16 loop iterations (then extract bytes from the resulting vector,
one byte per loop iteration).  For i <= 0xffff, it's one bit reversal
every 8 iterations, then for i <= 0xffffff it's one every 5, and so on.
(Perhaps including the optimization for the case of i <= 0xff is not
worth it, as that's too few bit reversal operations to matter anyway.
Maybe none of this optimization is worth it, though.)

> > Considering other architectures have native bit reversal instructions,

I just recalled that AMD XOP's VPPERM, which is an extension of PSHUFB,
can be directly used to perform bit reversal.  Like PSHUFB, it operates
on bytes rather than bits, but it has more special modes, including bit
reversal (within the bytes, which can also be reversed by the same
instruction).  (It can even do bit reversal and inversion at once, but
we probably don't need that.)

> > it may be an acceptable solution compared to an ad hoc permutation.
> 
> I agree.

I've attached a program I wrote, testing bit reversal with VPPERM and
PSHUFB.  It seems to work fine.  Can the SSSE3 case be optimized further?

$ gcc bitrev.c -o bitrev -s -O2 -mxop; size bitrev; ./bitrev
   text    data     bss     dec     hex filename
   1299     504      16    1819     71b bitrev
a5a55a5a 2480137f 1e6a2c48 ff33ccaa
$ gcc bitrev.c -o bitrev -s -O2 -mavx; size bitrev; ./bitrev
   text    data     bss     dec     hex filename
   1427     504      16    1947     79b bitrev
a5a55a5a 2480137f 1e6a2c48 ff33ccaa
$ gcc bitrev.c -o bitrev -s -O2 -mssse3; size bitrev; ./bitrev
   text    data     bss     dec     hex filename
   1443     504      16    1963     7ab bitrev
a5a55a5a 2480137f 1e6a2c48 ff33ccaa

BTW, reusing the value of "sel" that I define for VPPERM also for PSHUFB
works fine - those extra bits are simply ignored, and the instruction
appears to be defined to ignore them (so it's probably behavior we can
rely upon) - but I don't do that in this program just in case (and to
avoid possibly including a comment on that).  The only savings would be
in source code, not in compiled code (nor data).

Alexander

View attachment "bitrev.c" of type "text/x-c" (1055 bytes)