[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140130013231.GA12613@openwall.com>
Date: Thu, 30 Jan 2014 05:32:31 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] Combining sliding window and bit-reversal
On Tue, Jan 28, 2014 at 04:05:29AM +0400, Solar Designer wrote:
> On Mon, Jan 27, 2014 at 11:49:37PM +0000, Samuel Neves wrote:
> > You can get decent bit reversal speed using PSHUFB, by reversing up to
> > 32 4-bit chunks per cycle.
>
> Yes, but unfortunately this adds overhead of going from scalar to vector
> and back (we need to bit-reverse a single 32-bit or 64-bit value), and
> SIMD instructions generally have lower throughput than scalar ones. Yet
> it might be the fastest approach. Will need to benchmark.
I haven't benchmarked yet, but I realized that it's actually OK that
this works in vector form, since we readily know what further indices
needing reversal will be. We can simply perform the vectorized bit
reversal operation every 4 loop iterations, or whatever matches the
vector size. In fact, a smart (but overly complicated?) implementation
could split the full vector width into fields of arbitrary length in
bytes, so e.g. while i <= 0xff, we'll only do one SIMD bit reversal
every 16 loop iterations (then extract bytes from the resulting vector,
one byte per loop iteration). For i <= 0xffff, it's one bit reversal
every 8 iterations, then for i <= 0xffffff it's one every 5, and so on.
(Perhaps including the optimization for the case of i <= 0xff is not
worth it, as that's too few bit reversal operations to matter anyway.
Maybe none of this optimization is worth it, though.)
> > Considering other architectures have native bit reversal instructions,
I just recalled that AMD XOP's VPPERM, which is an extension of PSHUFB,
can be directly used to perform bit reversal. Like PSHUFB, it operates
on bytes rather than bits, but it has more special modes, including bit
reversal (within the bytes, which can also be reversed by the same
instruction). (It can even do bit reversal and inversion at once, but
we probably don't need that.)
> > it may be an acceptable solution compared to an ad hoc permutation.
>
> I agree.
I've attached a program I wrote, testing bit reversal with VPPERM and
PSHUFB. It seems to work fine. Can the SSSE3 case be optimized further?
$ gcc bitrev.c -o bitrev -s -O2 -mxop; size bitrev; ./bitrev
text data bss dec hex filename
1299 504 16 1819 71b bitrev
a5a55a5a 2480137f 1e6a2c48 ff33ccaa
$ gcc bitrev.c -o bitrev -s -O2 -mavx; size bitrev; ./bitrev
text data bss dec hex filename
1427 504 16 1947 79b bitrev
a5a55a5a 2480137f 1e6a2c48 ff33ccaa
$ gcc bitrev.c -o bitrev -s -O2 -mssse3; size bitrev; ./bitrev
text data bss dec hex filename
1443 504 16 1963 7ab bitrev
a5a55a5a 2480137f 1e6a2c48 ff33ccaa
BTW, reusing the value of "sel" that I define for VPPERM also for PSHUFB
works fine - those extra bits are simply ignored, and the instruction
appears to be defined to ignore them (so it's probably behavior we can
rely upon) - but I don't do that in this program just in case (and to
avoid possibly including a comment on that). The only savings would be
in source code, not in compiled code (nor data).
Alexander
View attachment "bitrev.c" of type "text/x-c" (1055 bytes)
Powered by blists - more mailing lists