[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 12 Dec 2016 06:48:27 +0100
From: "Jason A. Donenfeld" <Jason@...c4.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: "kernel-hardening@...ts.openwall.com"
<kernel-hardening@...ts.openwall.com>,
LKML <linux-kernel@...r.kernel.org>,
Linux Crypto Mailing List <linux-crypto@...r.kernel.org>,
George Spelvin <linux@...izon.com>,
Scott Bauer <sbauer@....utah.edu>,
Andi Kleen <ak@...ux.intel.com>,
Andy Lutomirski <luto@...capital.net>,
Greg KH <gregkh@...uxfoundation.org>,
Jean-Philippe Aumasson <jeanphilippe.aumasson@...il.com>,
"Daniel J . Bernstein" <djb@...yp.to>
Subject: Re: [PATCH v2] siphash: add cryptographically secure hashtable function
Hey Linus,
On Mon, Dec 12, 2016 at 5:01 AM, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
> The above is extremely inefficient. Considering that most kernel data
> would be expected to be smallish, that matters (ie the usual benchmark
> would not be about hashing megabytes of data, but instead millions of
> hashes of small data).
>
> I think this could be rewritten (at least for 64-bit architectures) as
>
> #ifdef CONFIG_DCACHE_WORD_ACCESS
>
> if (left)
> b |= le64_to_cpu(load_unaligned_zeropad(data) &
> bytemask_from_count(left));
>
> #else
>
> .. do the duff's device thing with the switch() ..
>
> #endif
>
> which should give you basically perfect code generation (ie a single
> 64-bit load and a byte mask).
I modified the test to hash data of size 0 through 7 repeatedly
100000000 times, and benchmarked that a few times on a Skylake laptop.
The `load_unaligned_zeropad & bytemask_from_count` version was
consistently 7% slower.
I then modified it again to simply hash a 4 byte constant repeatedly
1000000000 times. The `load_unaligned_zeropad & bytemask_from_count`
version was around 6% faster. I tried again with a 7 byte constant and
got more or less a similar result.
Then I tried with a 1 byte constant, and found that the
`load_unaligned_zeropad & bytemask_from_count` version was slower.
So, it would seem that between the `if (left)` and the `switch
(left)`, there's the same number of branches. But for small values of
`left`, the duff's device just has simpler arithmetic, whereas for
large values of `left`, the `load_unaligned_zeropad` prevails. If
micro-optimization is really appealing, one could imagine a hybrid of
the two:
switch (left) {
case 7:
case 6:
case 5:
case 4:
b |= le64_to_cpu(load_unaligned_zeropad(data) &
bytemask_from_count(left));
break;
case 3: b |= ((u64)data[2]) << 16;
case 2: b |= ((u64)data[1]) << 8;
case 1: b |= ((u64)data[0]); break;
case 0: break;
}
But I'm not sure this complication is worth it, and it might be more
likely that the left-over size is 4 bytes most of the time, so we
should just use your trick on platforms that support it.
Jason
Powered by blists - more mailing lists