[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140111173014.GA6636@openwall.com>
Date: Sat, 11 Jan 2014 21:30:14 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] escrypt memory access speed (Re: [PHC] Reworked KDF available on github for feedback: NOELKDF)
Bill,
On Sat, Jan 04, 2014 at 10:07:58AM +0400, Solar Designer wrote:
> In escrypt running in this mode, SMix first loop has sequential writes
> and random reads, and second loop has random reads only. So it's 3*N
> memory accesses (not counting accesses to data that is presumed to be in
> L1 cache).
BTW, that data "presumed to be in L1 cache", such as the r-sized X and Y
buffers used in the second loop, may be getting copied to memory once in
a while, thereby wasting memory bandwidth. It is unfortunate that we
only have caches and no local memory.
On Sat, Jan 11, 2014 at 09:01:20PM +0400, Solar Designer wrote:
> 2*3*10*2^30/10^9/5.362 = ~12 GB/s
>
> I suspect that some of the memory bandwidth might be wasted on reading
> from to-be-written-to memory locations into cache, before the
> corresponding cache lines are finally complete with the newly written
> data and are written out back to memory.
With these two undesired effects combined, about one half of the memory
bandwidth might be wasted.
Here are two examples of how gcc places the memory write instructions:
3a7: c5 f1 fe ed vpaddd %xmm5,%xmm1,%xmm5
3ab: c5 e1 fe e2 vpaddd %xmm2,%xmm3,%xmm4
3af: c5 f9 70 db 39 vpshufd $0x39,%xmm3,%xmm3
--> 3b4: c5 f8 29 6e 30 vmovaps %xmm5,0x30(%rsi)
3b9: c5 f9 70 d2 4e vpshufd $0x4e,%xmm2,%xmm2
3be: 8f e8 78 c2 e4 12 vprotd $0x12,%xmm4,%xmm4
3c4: c5 39 ef c4 vpxor %xmm4,%xmm8,%xmm8
3c8: c5 e1 fe ff vpaddd %xmm7,%xmm3,%xmm7
3cc: c5 e9 fe f6 vpaddd %xmm6,%xmm2,%xmm6
3d0: c5 f8 28 e5 vmovaps %xmm5,%xmm4
3d4: c5 b9 fe c0 vpaddd %xmm0,%xmm8,%xmm0
3d8: c5 78 28 cf vmovaps %xmm7,%xmm9
--> 3dc: c5 f8 29 7e 10 vmovaps %xmm7,0x10(%rsi)
3e1: c5 78 28 c6 vmovaps %xmm6,%xmm8
3e5: c5 78 28 d8 vmovaps %xmm0,%xmm11
--> 3e9: c5 f8 29 06 vmovaps %xmm0,(%rsi)
--> 3ed: c5 f8 29 76 20 vmovaps %xmm6,0x20(%rsi)
3f2: 0f 84 58 05 00 00 je 950 <blockmix_salsa8+0x8d0>
2221: c5 f1 fe ed vpaddd %xmm5,%xmm1,%xmm5
2225: c5 e1 fe e2 vpaddd %xmm2,%xmm3,%xmm4
2229: c5 f9 70 db 39 vpshufd $0x39,%xmm3,%xmm3
--> 222e: c5 f8 29 6c 02 70 vmovaps %xmm5,0x70(%rdx,%rax,1)
2234: c5 f9 70 d2 4e vpshufd $0x4e,%xmm2,%xmm2
2239: 8f e8 78 c2 e4 12 vprotd $0x12,%xmm4,%xmm4
223f: c5 39 ef c4 vpxor %xmm4,%xmm8,%xmm8
2243: c5 e1 fe ff vpaddd %xmm7,%xmm3,%xmm7
2247: c5 e9 fe f6 vpaddd %xmm6,%xmm2,%xmm6
224b: c5 f8 28 e5 vmovaps %xmm5,%xmm4
224f: c5 b9 fe c0 vpaddd %xmm0,%xmm8,%xmm0
--> 2253: c5 f8 29 7c 02 50 vmovaps %xmm7,0x50(%rdx,%rax,1)
2259: c5 f8 28 df vmovaps %xmm7,%xmm3
--> 225d: c5 f8 29 44 02 40 vmovaps %xmm0,0x40(%rdx,%rax,1)
2263: c5 78 28 d0 vmovaps %xmm0,%xmm10
2267: c5 f8 28 d6 vmovaps %xmm6,%xmm2
--> 226b: c5 f8 29 74 02 60 vmovaps %xmm6,0x60(%rdx,%rax,1)
2271: 48 83 c0 40 add $0x40,%rax
2275: 4c 39 f9 cmp %r15,%rcx
2278: 0f 87 22 fa ff ff ja 1ca0 <blockmix_salsa8_xor_save+0x400>
As you can see, these are not very close together, and the first of four
writes in these two examples happens to be to the end of the cache line,
not the beginning, even though in the source these are close together
and the lower address right comes first. Clearly, these were rescheduled
in order to better meet other instructions' latencies (not attempt
writes when their source operands are not yet out of the pipeline).
Now, I could pack them closer together with asm code, but then the CPU
would probably similarly re-order them based on source operand readiness.
Maybe replacing or following Salsa20 with something that produces the
outputs in the order that matches the memory write order would do the
trick. Wait, we can simply write the outputs in whatever order matches
Salsa20 (or whatever) best. This should be easy to test. From the
gcc-generated code above, it appears the optimal order would be 3,1,0,2.
Of course, this will break compatibility with scrypt, so is only usable
other than in classic scrypt compatibility mode.
Alexander
Powered by blists - more mailing lists