phc-discussions - Re: [PHC] escrypt memory access speed (Re: [PHC] Reworked KDF available on github for feedback: NOELKDF)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140111173014.GA6636@openwall.com>
Date: Sat, 11 Jan 2014 21:30:14 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] escrypt memory access speed (Re: [PHC] Reworked KDF available on github for feedback: NOELKDF)

Bill,

On Sat, Jan 04, 2014 at 10:07:58AM +0400, Solar Designer wrote:
> In escrypt running in this mode, SMix first loop has sequential writes
> and random reads, and second loop has random reads only.  So it's 3*N
> memory accesses (not counting accesses to data that is presumed to be in
> L1 cache).

BTW, that data "presumed to be in L1 cache", such as the r-sized X and Y
buffers used in the second loop, may be getting copied to memory once in
a while, thereby wasting memory bandwidth.  It is unfortunate that we
only have caches and no local memory.

On Sat, Jan 11, 2014 at 09:01:20PM +0400, Solar Designer wrote:
> 2*3*10*2^30/10^9/5.362 = ~12 GB/s
> 
> I suspect that some of the memory bandwidth might be wasted on reading
> from to-be-written-to memory locations into cache, before the
> corresponding cache lines are finally complete with the newly written
> data and are written out back to memory.

With these two undesired effects combined, about one half of the memory
bandwidth might be wasted.

Here are two examples of how gcc places the memory write instructions:

     3a7:       c5 f1 fe ed             vpaddd %xmm5,%xmm1,%xmm5
     3ab:       c5 e1 fe e2             vpaddd %xmm2,%xmm3,%xmm4
     3af:       c5 f9 70 db 39          vpshufd $0x39,%xmm3,%xmm3
-->  3b4:       c5 f8 29 6e 30          vmovaps %xmm5,0x30(%rsi)
     3b9:       c5 f9 70 d2 4e          vpshufd $0x4e,%xmm2,%xmm2
     3be:       8f e8 78 c2 e4 12       vprotd $0x12,%xmm4,%xmm4
     3c4:       c5 39 ef c4             vpxor  %xmm4,%xmm8,%xmm8
     3c8:       c5 e1 fe ff             vpaddd %xmm7,%xmm3,%xmm7
     3cc:       c5 e9 fe f6             vpaddd %xmm6,%xmm2,%xmm6
     3d0:       c5 f8 28 e5             vmovaps %xmm5,%xmm4
     3d4:       c5 b9 fe c0             vpaddd %xmm0,%xmm8,%xmm0
     3d8:       c5 78 28 cf             vmovaps %xmm7,%xmm9
-->  3dc:       c5 f8 29 7e 10          vmovaps %xmm7,0x10(%rsi)
     3e1:       c5 78 28 c6             vmovaps %xmm6,%xmm8
     3e5:       c5 78 28 d8             vmovaps %xmm0,%xmm11
-->  3e9:       c5 f8 29 06             vmovaps %xmm0,(%rsi)
-->  3ed:       c5 f8 29 76 20          vmovaps %xmm6,0x20(%rsi)
     3f2:       0f 84 58 05 00 00       je     950 <blockmix_salsa8+0x8d0>

    2221:       c5 f1 fe ed             vpaddd %xmm5,%xmm1,%xmm5
    2225:       c5 e1 fe e2             vpaddd %xmm2,%xmm3,%xmm4
    2229:       c5 f9 70 db 39          vpshufd $0x39,%xmm3,%xmm3
--> 222e:       c5 f8 29 6c 02 70       vmovaps %xmm5,0x70(%rdx,%rax,1)
    2234:       c5 f9 70 d2 4e          vpshufd $0x4e,%xmm2,%xmm2
    2239:       8f e8 78 c2 e4 12       vprotd $0x12,%xmm4,%xmm4
    223f:       c5 39 ef c4             vpxor  %xmm4,%xmm8,%xmm8
    2243:       c5 e1 fe ff             vpaddd %xmm7,%xmm3,%xmm7
    2247:       c5 e9 fe f6             vpaddd %xmm6,%xmm2,%xmm6
    224b:       c5 f8 28 e5             vmovaps %xmm5,%xmm4
    224f:       c5 b9 fe c0             vpaddd %xmm0,%xmm8,%xmm0
--> 2253:       c5 f8 29 7c 02 50       vmovaps %xmm7,0x50(%rdx,%rax,1)
    2259:       c5 f8 28 df             vmovaps %xmm7,%xmm3
--> 225d:       c5 f8 29 44 02 40       vmovaps %xmm0,0x40(%rdx,%rax,1)
    2263:       c5 78 28 d0             vmovaps %xmm0,%xmm10
    2267:       c5 f8 28 d6             vmovaps %xmm6,%xmm2
--> 226b:       c5 f8 29 74 02 60       vmovaps %xmm6,0x60(%rdx,%rax,1)
    2271:       48 83 c0 40             add    $0x40,%rax
    2275:       4c 39 f9                cmp    %r15,%rcx
    2278:       0f 87 22 fa ff ff       ja     1ca0 <blockmix_salsa8_xor_save+0x400>

As you can see, these are not very close together, and the first of four
writes in these two examples happens to be to the end of the cache line,
not the beginning, even though in the source these are close together
and the lower address right comes first.  Clearly, these were rescheduled
in order to better meet other instructions' latencies (not attempt
writes when their source operands are not yet out of the pipeline).
Now, I could pack them closer together with asm code, but then the CPU
would probably similarly re-order them based on source operand readiness.

Maybe replacing or following Salsa20 with something that produces the
outputs in the order that matches the memory write order would do the
trick.  Wait, we can simply write the outputs in whatever order matches
Salsa20 (or whatever) best.  This should be easy to test.  From the
gcc-generated code above, it appears the optimal order would be 3,1,0,2.
Of course, this will break compatibility with scrypt, so is only usable
other than in classic scrypt compatibility mode.

Alexander