[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <466B46D5.1020004@cs.cmu.edu>
Date: Sat, 09 Jun 2007 20:33:25 -0400
From: Benjamin Gilbert <bgilbert@...cmu.edu>
To: Jeff Garzik <jeff@...zik.org>
CC: Matt Mackall <mpm@...enic.com>, akpm@...ux-foundation.org,
herbert@...dor.apana.org.au, linux-crypto@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
Jeff Garzik wrote:
> Matt Mackall wrote:
>> Have you benchmarked this against lib/sha1.c? Please post the results.
>> Until then, I'm frankly skeptical that your unrolled version is faster
>> because when I introduced lib/sha1.c the rolled version therein won by
>> a significant margin and had 1/10th the cache footprint.
See the benchmark tables in patch 0 at the head of this thread.
Performance improved by at least 25% in every test, and 40-60% was more
common for the 32-bit version (on a Pentium IV).
It's not just the loop unrolling; it's the register allocation and
spilling. For comparison, I built SHATransform() from the
drivers/char/random.c in 2.6.11, using gcc 3.3.5 with -O2 and
SHA_CODE_SIZE == 3 (i.e., fully unrolled); I'm guessing this is pretty
close to what you tested back then. The resulting code is 49% MOV
instructions, and 80% of *those* involve memory. gcc4 is somewhat
better, but it still spills a whole lot, both for the 2.6.11 unrolled
code and for the current lib/sha1.c.
In contrast, the assembly implementation in this patch only has to go to
memory for data and workspace (with one small exception in the F3
rounds), and the workspace has a fifth of the cache footprint of the
default implementation.
> Yes. And it also depends on the CPU as well. Testing on a server-class
> x86 CPU (often with bigger L2, and perhaps even L1, cache) will produce
> different result than from popular but less-capable "value" CPUs.
Good point. I benchmarked the 32-bit assembly code on a couple more boxes:
=== AMD Duron, average of 5 trials ===
Test# Bytes/ Bytes/ Cyc/B Cyc/B Change
block update (C) (asm)
0 16 16 104 72 31%
1 64 16 52 36 31%
2 64 64 45 29 36%
3 256 16 33 23 30%
4 256 64 27 17 37%
5 256 256 24 14 42%
6 1024 16 29 20 31%
7 1024 256 20 11 45%
8 1024 1024 19 11 42%
9 2048 16 28 20 29%
10 2048 256 19 11 42%
11 2048 1024 18 10 44%
12 2048 2048 18 10 44%
13 4096 16 28 19 32%
14 4096 256 18 10 44%
15 4096 1024 18 10 44%
16 4096 4096 18 10 44%
17 8192 16 27 19 30%
18 8192 256 18 10 44%
19 8192 1024 18 10 44%
20 8192 4096 17 10 41%
21 8192 8192 17 10 41%
=== Classic Pentium, average of 5 trials ===
Test# Bytes/ Bytes/ Cyc/B Cyc/B Change
block update (C) (asm)
0 16 16 145 144 1%
1 64 16 72 61 15%
2 64 64 65 52 20%
3 256 16 46 39 15%
4 256 64 39 32 18%
5 256 256 36 29 19%
6 1024 16 40 33 18%
7 1024 256 30 23 23%
8 1024 1024 29 23 21%
9 2048 16 39 32 18%
10 2048 256 29 22 24%
11 2048 1024 28 22 21%
12 2048 2048 28 22 21%
13 4096 16 38 32 16%
14 4096 256 28 22 21%
15 4096 1024 28 21 25%
16 4096 4096 27 21 22%
17 8192 16 38 32 16%
18 8192 256 28 22 21%
19 8192 1024 28 21 25%
20 8192 4096 27 21 22%
21 8192 8192 27 21 22%
The improvement isn't as good, but it's still noticeable.
--Benjamin Gilbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists