[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140419013346.GC18678@openwall.com>
Date: Sat, 19 Apr 2014 05:33:46 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] Non-temporal writes and uninitialized memory
On Fri, Apr 18, 2014 at 11:53:33AM -0400, Bill Cox wrote:
> I've been banging my head against a crazy problem for some time. Using
> temporal writes, I should be able to speed up TwoCats. Nope! Nothing
> worked, and I tried many combinations.
>
> Here's what I think is going on. When I write hash data to a block of
> uninitialized memory that I allocated with malloc (or posix_memalign),
> somehow the CPU knows this, and therefore it does not bother to read the
> cache line, modify it, and write it, like it normally does. Instead, it
> just buffers writes until a cache line is full, and then it writes that
> cache line to cache.
No, that's not it. Here's my understanding:
When you write to newly allocated memory, you incur page faults, which
result in physical memory pages getting mapped to those addresses. The
kernel zeroes out old data on those pages at this time, in order to
avoid leaking potentially sensitive info (kernel's own and other
processes') to userspace. As a side-effect, the most recently mapped
page is already in cache by the time control returns to our userspace
code. With 4 KiB pages, it's L1 cache. With 2 MiB pages, it's L2+L3.
This is what makes read-modify-write faster than it would have been on
previously used memory (by the same process).
Now, whether read-modify-write actually occurs on write-only accesses is
not certain. Write combining, store buffers, and line fill buffers may
help avoid those unneeded reads when the line is filled by the writes
quickly enough (before the store buffer or LFB would need to be reused?)
I failed to find reliable info on this, though. The implied reads are
definitely avoided for memory regions explicitly configured as write
combining, which is primarily used for accessing graphics cards memory
(where reads would be extremely slow), but that's not our case here.
A similarly curious question: does the zeroing of memory pages by the
kernel incur unneeded reads? I hope not.
> Temporal loads for some reason never help at all.
I think you mean non-temporal, _mm_stream_load_si128().
I only experimented with them after having added prefetches, so in my
case non-temporal load instructions were redundant with the previously
used _MM_HINT_NTA (only used on the ROM, when one is being used).
_MM_HINT_NTA does help on Bulldozer. I think non-temporal load
instructions would similarly help on Bulldozer if I were not already
using _MM_HINT_NTA where appropriate.
> Here's the temporal
> write instruction I use to speed up writing to previously initialized
> memory:
>
> _mm_stream_si128(p++, value);
Yeah, I had tried that too. No luck.
> TwoCats currently has no method for writing to previously initialized
> memory, so it's no help to me. Some of the other entries, like Yescript
> and Lyra2 should be able to benefit from it, but only in the second loop,
> not in the first.
When YESCRYPT_RW is set, yescrypt's second loop writes only to the same
V_j that has just been read, so it's already in cache. When YESCRYPT_RW
is not set, yescrypt's second loop only reads.
In the first loop, each page being written to has just been zeroed by
the kernel, so it's in cache.
Alexander
Powered by blists - more mailing lists