[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161222035549.10827.qmail@ns.sciencehorizons.net>
Date: 21 Dec 2016 22:55:49 -0500
From: "George Spelvin" <linux@...encehorizons.net>
To: ak@...ux.intel.com, davem@...emloft.net, David.Laight@...lab.com,
djb@...yp.to, ebiggers3@...il.com, eric.dumazet@...il.com,
hannes@...essinduktion.org, Jason@...c4.com,
jeanphilippe.aumasson@...il.com,
kernel-hardening@...ts.openwall.com, linux-crypto@...r.kernel.org,
linux-kernel@...r.kernel.org, linux@...encehorizons.net,
luto@...capital.net, netdev@...r.kernel.org, tom@...bertland.com,
torvalds@...ux-foundation.org, tytso@....edu,
vegard.nossum@...il.com
Subject: Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
> Plus the benchmark was bogus anyway, and when I built a more specific
> harness -- actually comparing the TCP sequence number functions --
> SipHash was faster than MD5, even on register starved x86. So I think
> we're fine and this chapter of the discussion can come to a close, in
> order to move on to more interesting things.
Do we have to go through this? No, the benchmark was *not* bogus.
Here's myresults from *your* benchmark. I can't reboot some of my test
machines, so I took net/core/secure_seq.c, lib/siphash.c, lib/md5.c and
include/linux/siphash.h straight out of your test tree.
Then I replaced the kernel #includes with the necessary typedefs
and #defines to make it compile in user-space. (Voluminous but
straightforward.) E.g.
#define __aligned(x) __attribute__((__aligned__(x)))
#define ____cacheline_aligned __aligned(64)
#define CONFIG_INET 1
#define IS_ENABLED(x) 1
#define ktime_get_real_ns() 0
#define sysctl_tcp_timestamps 0
... etc.
Then I modified your benchmark code into the appended code. The
differences are:
* I didn't iterate 100K times, I timed the functions *once*.
* I saved the times in a buffer and printed them all at the end
so printf() wouldn't pollute the caches.
* Before every even-numbered iteration, I flushed the I-cache
of everything from _init to _fini (i.e. all the non-library code).
This cold-cache case is what is going to happen in the kernel.
In the results below, note that I did *not* re-flush between phases
of the test. The effects of cacheing is clearly apparent in the tcpv4
results, where the tcpv6 code loaded the cache.
You can also see that the SipHash code benefits more from cacheing when
entered with a cold cache, as it iterates over the input words, while
the MD5 code is one big unrolled blob.
Order of computation is down the columns first, across second.
The P4 results were:
tcpv6 md5 cold: 4084 3488 3584 3584 3568
tcpv4 md5 cold: 1052 996 996 1060 996
tcpv6 siphash cold: 4080 3296 3312 3296 3312
tcpv4 siphash cold: 2968 2748 2972 2716 2716
tcpv6 md5 hot: 900 712 712 712 712
tcpv4 md5 hot: 632 672 672 672 672
tcpv6 siphash hot: 2484 2292 2340 2340 2340
tcpv4 siphash hot: 1660 1560 1564 2340 1564
SipHash actually wins slightly in the cold-cache case, because
it iterates more. In the hot-cache case, it loses horribly.
Core 2 duo:
tcpv6 md5 cold: 3396 2868 2964 3012 2832
tcpv4 md5 cold: 1368 1044 1320 1332 1308
tcpv6 siphash cold: 2940 2952 2916 2448 2604
tcpv4 siphash cold: 3192 2988 3576 3504 3624
tcpv6 md5 hot: 1116 1032 996 1008 1008
tcpv4 md5 hot: 936 936 936 936 936
tcpv6 siphash hot: 1200 1236 1236 1188 1188
tcpv4 siphash hot: 936 804 804 804 804
Pretty much a tie, honestly.
Ivy Bridge:
tcpv6 md5 cold: 6086 6136 6962 6358 6060
tcpv4 md5 cold: 816 732 1046 1054 1012
tcpv6 siphash cold: 3756 1886 2152 2390 2566
tcpv4 siphash cold: 3264 2108 3026 3120 3526
tcpv6 md5 hot: 1062 808 824 824 832
tcpv4 md5 hot: 730 730 740 748 748
tcpv6 siphash hot: 960 952 936 1112 926
tcpv4 siphash hot: 638 544 562 552 560
Modern processors *hate* cold caches. But notice how md5 is *faster*
than SipHash on hot-cache IPv6.
Ivy Bridge, -m64:
tcpv6 md5 cold: 4680 3672 3956 3616 3525
tcpv4 md5 cold: 1066 1416 1179 1179 1134
tcpv6 siphash cold: 940 1258 1995 1609 2255
tcpv4 siphash cold: 1440 1269 1292 1870 1621
tcpv6 md5 hot: 1372 1111 1122 1088 1088
tcpv4 md5 hot: 997 997 997 997 998
tcpv6 siphash hot: 340 340 340 352 340
tcpv4 siphash hot: 227 238 238 238 238
Of course, when you compile -m64, SipHash is unbeatable.
Here's the modified benchmark() code. The entire package is
a bit voluminous for the mailing list, but anyone is welcome to it.
static void clflush(void)
{
extern char const _init, _fini;
char const *p = &_init;
while (p < &_fini) {
asm("clflush %0" : : "m" (*p));
p += 64;
}
}
typedef uint32_t cycles_t;
static cycles_t get_cycles(void)
{
uint32_t eax, edx;
asm volatile("rdtsc" : "=a" (eax), "=d" (edx));
return eax;
}
static int benchmark(void)
{
cycles_t start, finish;
int i;
u32 seq_number = 0;
__be32 saddr6[4] = { 1, 4, 182, 393 }, daddr6[4] = { 9192, 18288, 2222222, 0xffffff10 };
__be32 saddr4 = 28888, daddr4 = 182112;
__be16 sport = 22, dport = 41992;
u32 tsoff;
cycles_t result[4];
printf("seq num benchmark\n");
for (i = 0; i < 10; i++) {
if ((i & 1) == 0)
clflush();
start = get_cycles();
seq_number += secure_tcpv6_sequence_number_md5(saddr6, daddr6, sport, dport, &tsoff);
finish = get_cycles();
result[0] = finish - start;
start = get_cycles();
seq_number += secure_tcp_sequence_number_md5(saddr4, daddr4, sport, dport, &tsoff);
finish = get_cycles();
result[1] = finish - start;
start = get_cycles();
seq_number += secure_tcpv6_sequence_number(saddr6, daddr6, sport, dport, &tsoff);
finish = get_cycles();
result[2] = finish - start;
start = get_cycles();
seq_number += secure_tcp_sequence_number(saddr4, daddr4, sport, dport, &tsoff);
finish = get_cycles();
result[3] = finish - start;
printf("* Iteration %d results:\n", i);
printf("secure_tcpv6_sequence_number_md5# cycles: %u\n", result[0]);
printf("secure_tcp_sequence_number_md5# cycles: %u\n", result[1]);
printf("secure_tcpv6_sequence_number_siphash# cycles: %u\n", result[2]);
printf("secure_tcp_sequence_number_siphash# cycles: %u\n", result[3]);
printf("benchmark result: %u\n", seq_number);
}
printf("benchmark result: %u\n", seq_number);
return 0;
}
//device_initcall(benchmark);
int
main(void)
{
memset(net_secret, 0xff, sizeof net_secret);
memset(net_secret_md5, 0xff, sizeof net_secret_md5);
return benchmark();
}
Powered by blists - more mailing lists