[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110814095910.GA18809@liondog.tnic>
Date: Sun, 14 Aug 2011 11:59:10 +0200
From: Borislav Petkov <bp@...en8.de>
To: Ingo Molnar <mingo@...e.hu>
Cc: melwyn lobo <linux.melwyn@...il.com>, linux-kernel@...r.kernel.org,
"H. Peter Anvin" <hpa@...or.com>,
Thomas Gleixner <tglx@...utronix.de>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
borislav.petkov@....com
Subject: Re: x86 memcpy performance
On Fri, Aug 12, 2011 at 09:52:20PM +0200, Ingo Molnar wrote:
> Sounds very interesting - it would be nice to see 'perf record' +
> 'perf report' profiles done on that workload, before and after your
> patches.
FWIW, I've been playing with SSE memcpy version for the kernel recently
too, here's what I have so far:
First of all, I did a trace of all the memcpy buffer sizes used while
building a kernel, see attached kernel_build.sizes.
On the one hand, there is a large amount of small chunks copied (1.1M
of 1.2M calls total), and, on the other, a relatively small amount of
larger sized mem copies (256 - 2048 bytes) which are about 100K in total
but which account for the larger cumulative amount of data copied: 138MB
of 175MB total. So, if the buffer copied is big enough, the context
save/restore cost might be something we're willing to pay.
I've implemented the SSE memcpy first in userspace to measure the
speedup vs memcpy_64 we have right now:
Benchmarking with 10000 iterations, average results:
size XM MM speedup
119 540.58 449.491 0.8314969419
189 296.318 263.507 0.8892692985
206 297.949 271.399 0.9108923485
224 255.565 235.38 0.9210161798
221 299.383 276.628 0.9239941159
245 299.806 279.432 0.9320430545
369 314.774 316.89 1.006721324
425 327.536 330.475 1.00897153
439 330.847 334.532 1.01113687
458 333.159 340.124 1.020904708
503 334.44 352.166 1.053003229
767 375.612 429.949 1.144661625
870 358.888 312.572 0.8709465025
882 394.297 454.977 1.153893229
925 403.82 472.56 1.170222413
1009 407.147 490.171 1.203915735
1525 512.059 660.133 1.289174911
1737 556.85 725.552 1.302958536
1778 533.839 711.59 1.332965994
1864 558.06 745.317 1.335549882
2039 585.915 813.806 1.388949687
3068 766.462 1105.56 1.442422252
3471 883.983 1239.99 1.40272883
3570 895.822 1266.74 1.414057295
3748 906.832 1302.4 1.436212771
4086 957.649 1486.93 1.552686041
6130 1238.45 1996.42 1.612023046
6961 1413.11 2201.55 1.557939181
7162 1385.5 2216.49 1.59977178
7499 1440.87 2330.12 1.617158856
8182 1610.74 2720.45 1.688950194
12273 2307.86 4042.88 1.751787902
13924 2431.8 4224.48 1.737184756
14335 2469.4 4218.82 1.708440514
15018 2675.67 1904.07 0.711622886
16374 2989.75 5296.26 1.771470902
24564 4262.15 7696.86 1.805863077
27852 4362.53 3347.72 0.7673805572
28672 5122.8 7113.14 1.388524413
30033 4874.62 8740.04 1.792967931
32768 6014.78 7564.2 1.257603505
49142 14464.2 21114.2 1.459757233
55702 16055 23496.8 1.463523623
57339 16725.7 24553.8 1.46803388
60073 17451.5 24407.3 1.398579162
Size is with randomly generated misalignment to test the implementation.
I've implemented the SSE memcpy similar to arch/x86/lib/mmx_32.c and did
some kernel build traces:
with SSE memcpy
===============
Performance counter stats for '/root/boris/bin/build-kernel.sh' (10 runs):
3301761.517649 task-clock # 24.001 CPUs utilized ( +- 1.48% )
520,658 context-switches # 0.000 M/sec ( +- 0.25% )
63,845 CPU-migrations # 0.000 M/sec ( +- 0.58% )
26,070,835 page-faults # 0.008 M/sec ( +- 0.00% )
1,812,482,599,021 cycles # 0.549 GHz ( +- 0.85% ) [64.55%]
551,783,051,492 stalled-cycles-frontend # 30.44% frontend cycles idle ( +- 0.98% ) [65.64%]
444,996,901,060 stalled-cycles-backend # 24.55% backend cycles idle ( +- 1.15% ) [67.16%]
1,488,917,931,766 instructions # 0.82 insns per cycle
# 0.37 stalled cycles per insn ( +- 0.91% ) [69.25%]
340,575,978,517 branches # 103.150 M/sec ( +- 0.99% ) [68.29%]
21,519,667,206 branch-misses # 6.32% of all branches ( +- 1.09% ) [65.11%]
137.567155255 seconds time elapsed ( +- 1.48% )
plain 3.0
=========
Performance counter stats for '/root/boris/bin/build-kernel.sh' (10 runs):
3504754.425527 task-clock # 24.001 CPUs utilized ( +- 1.31% )
518,139 context-switches # 0.000 M/sec ( +- 0.32% )
61,790 CPU-migrations # 0.000 M/sec ( +- 0.73% )
26,056,947 page-faults # 0.007 M/sec ( +- 0.00% )
1,826,757,751,616 cycles # 0.521 GHz ( +- 0.66% ) [63.86%]
557,800,617,954 stalled-cycles-frontend # 30.54% frontend cycles idle ( +- 0.79% ) [64.65%]
443,950,768,357 stalled-cycles-backend # 24.30% backend cycles idle ( +- 0.60% ) [67.07%]
1,469,707,613,500 instructions # 0.80 insns per cycle
# 0.38 stalled cycles per insn ( +- 0.68% ) [69.98%]
335,560,565,070 branches # 95.744 M/sec ( +- 0.67% ) [69.09%]
21,365,279,176 branch-misses # 6.37% of all branches ( +- 0.65% ) [65.36%]
146.025263276 seconds time elapsed ( +- 1.31% )
So, although kernel build is probably not the proper workload for an
SSE memcpy routine, I'm seeing 9 secs build time improvement, i.e.
something around 6%. We're executing a bit more instructions but I'd say
the amount of data moved per instruction is higher due to the quadword
moves.
Here's the SSE memcpy version I got so far, I haven't wired in the
proper CPU feature detection yet because we want to run more benchmarks
like netperf and stuff to see whether we see any positive results there.
The SYSTEM_RUNNING check is to take care of early boot situations where
we can't handle FPU exceptions but we use memcpy. There's an aligned and
misaligned variant which should handle any buffers and sizes although
I've set the SSE memcpy threshold at 512 Bytes buffersize the least to
cover context save/restore somewhat.
Comments are much appreciated! :-)
--
>From 385519e844f3466f500774c2c37afe44691ef8d2 Mon Sep 17 00:00:00 2001
From: Borislav Petkov <borislav.petkov@....com>
Date: Thu, 11 Aug 2011 18:43:08 +0200
Subject: [PATCH] SSE3 memcpy in C
Signed-off-by: Borislav Petkov <borislav.petkov@....com>
---
arch/x86/include/asm/string_64.h | 14 ++++-
arch/x86/lib/Makefile | 2 +-
arch/x86/lib/sse_memcpy_64.c | 133 ++++++++++++++++++++++++++++++++++++++
3 files changed, 146 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/lib/sse_memcpy_64.c
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..7bd51bb 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -28,10 +28,20 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t
#define __HAVE_ARCH_MEMCPY 1
#ifndef CONFIG_KMEMCHECK
+extern void *__memcpy(void *to, const void *from, size_t len);
+extern void *__sse_memcpy(void *to, const void *from, size_t len);
#if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4
-extern void *memcpy(void *to, const void *from, size_t len);
+#define memcpy(dst, src, len) \
+({ \
+ size_t __len = (len); \
+ void *__ret; \
+ if (__len >= 512) \
+ __ret = __sse_memcpy((dst), (src), __len); \
+ else \
+ __ret = __memcpy((dst), (src), __len); \
+ __ret; \
+})
#else
-extern void *__memcpy(void *to, const void *from, size_t len);
#define memcpy(dst, src, len) \
({ \
size_t __len = (len); \
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index f2479f1..5f90709 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -36,7 +36,7 @@ ifneq ($(CONFIG_X86_CMPXCHG64),y)
endif
lib-$(CONFIG_X86_USE_3DNOW) += mmx_32.o
else
- obj-y += iomap_copy_64.o
+ obj-y += iomap_copy_64.o sse_memcpy_64.o
lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
lib-y += thunk_64.o clear_page_64.o copy_page_64.o
lib-y += memmove_64.o memset_64.o
diff --git a/arch/x86/lib/sse_memcpy_64.c b/arch/x86/lib/sse_memcpy_64.c
new file mode 100644
index 0000000..b53fc31
--- /dev/null
+++ b/arch/x86/lib/sse_memcpy_64.c
@@ -0,0 +1,133 @@
+#include <linux/module.h>
+
+#include <asm/i387.h>
+#include <asm/string_64.h>
+
+void *__sse_memcpy(void *to, const void *from, size_t len)
+{
+ unsigned long src = (unsigned long)from;
+ unsigned long dst = (unsigned long)to;
+ void *p = to;
+ int i;
+
+ if (in_interrupt())
+ return __memcpy(to, from, len);
+
+ if (system_state != SYSTEM_RUNNING)
+ return __memcpy(to, from, len);
+
+ kernel_fpu_begin();
+
+ /* check alignment */
+ if ((src ^ dst) & 0xf)
+ goto unaligned;
+
+ if (src & 0xf) {
+ u8 chunk = 0x10 - (src & 0xf);
+
+ /* copy chunk until next 16-byte */
+ __memcpy(to, from, chunk);
+ len -= chunk;
+ to += chunk;
+ from += chunk;
+ }
+
+ /*
+ * copy in 256 Byte portions
+ */
+ for (i = 0; i < (len & ~0xff); i += 256) {
+ asm volatile(
+ "movaps 0x0(%0), %%xmm0\n\t"
+ "movaps 0x10(%0), %%xmm1\n\t"
+ "movaps 0x20(%0), %%xmm2\n\t"
+ "movaps 0x30(%0), %%xmm3\n\t"
+ "movaps 0x40(%0), %%xmm4\n\t"
+ "movaps 0x50(%0), %%xmm5\n\t"
+ "movaps 0x60(%0), %%xmm6\n\t"
+ "movaps 0x70(%0), %%xmm7\n\t"
+ "movaps 0x80(%0), %%xmm8\n\t"
+ "movaps 0x90(%0), %%xmm9\n\t"
+ "movaps 0xa0(%0), %%xmm10\n\t"
+ "movaps 0xb0(%0), %%xmm11\n\t"
+ "movaps 0xc0(%0), %%xmm12\n\t"
+ "movaps 0xd0(%0), %%xmm13\n\t"
+ "movaps 0xe0(%0), %%xmm14\n\t"
+ "movaps 0xf0(%0), %%xmm15\n\t"
+
+ "movaps %%xmm0, 0x0(%1)\n\t"
+ "movaps %%xmm1, 0x10(%1)\n\t"
+ "movaps %%xmm2, 0x20(%1)\n\t"
+ "movaps %%xmm3, 0x30(%1)\n\t"
+ "movaps %%xmm4, 0x40(%1)\n\t"
+ "movaps %%xmm5, 0x50(%1)\n\t"
+ "movaps %%xmm6, 0x60(%1)\n\t"
+ "movaps %%xmm7, 0x70(%1)\n\t"
+ "movaps %%xmm8, 0x80(%1)\n\t"
+ "movaps %%xmm9, 0x90(%1)\n\t"
+ "movaps %%xmm10, 0xa0(%1)\n\t"
+ "movaps %%xmm11, 0xb0(%1)\n\t"
+ "movaps %%xmm12, 0xc0(%1)\n\t"
+ "movaps %%xmm13, 0xd0(%1)\n\t"
+ "movaps %%xmm14, 0xe0(%1)\n\t"
+ "movaps %%xmm15, 0xf0(%1)\n\t"
+ : : "r" (from), "r" (to) : "memory");
+
+ from += 256;
+ to += 256;
+ }
+
+ goto trailer;
+
+unaligned:
+ /*
+ * copy in 256 Byte portions unaligned
+ */
+ for (i = 0; i < (len & ~0xff); i += 256) {
+ asm volatile(
+ "movups 0x0(%0), %%xmm0\n\t"
+ "movups 0x10(%0), %%xmm1\n\t"
+ "movups 0x20(%0), %%xmm2\n\t"
+ "movups 0x30(%0), %%xmm3\n\t"
+ "movups 0x40(%0), %%xmm4\n\t"
+ "movups 0x50(%0), %%xmm5\n\t"
+ "movups 0x60(%0), %%xmm6\n\t"
+ "movups 0x70(%0), %%xmm7\n\t"
+ "movups 0x80(%0), %%xmm8\n\t"
+ "movups 0x90(%0), %%xmm9\n\t"
+ "movups 0xa0(%0), %%xmm10\n\t"
+ "movups 0xb0(%0), %%xmm11\n\t"
+ "movups 0xc0(%0), %%xmm12\n\t"
+ "movups 0xd0(%0), %%xmm13\n\t"
+ "movups 0xe0(%0), %%xmm14\n\t"
+ "movups 0xf0(%0), %%xmm15\n\t"
+
+ "movups %%xmm0, 0x0(%1)\n\t"
+ "movups %%xmm1, 0x10(%1)\n\t"
+ "movups %%xmm2, 0x20(%1)\n\t"
+ "movups %%xmm3, 0x30(%1)\n\t"
+ "movups %%xmm4, 0x40(%1)\n\t"
+ "movups %%xmm5, 0x50(%1)\n\t"
+ "movups %%xmm6, 0x60(%1)\n\t"
+ "movups %%xmm7, 0x70(%1)\n\t"
+ "movups %%xmm8, 0x80(%1)\n\t"
+ "movups %%xmm9, 0x90(%1)\n\t"
+ "movups %%xmm10, 0xa0(%1)\n\t"
+ "movups %%xmm11, 0xb0(%1)\n\t"
+ "movups %%xmm12, 0xc0(%1)\n\t"
+ "movups %%xmm13, 0xd0(%1)\n\t"
+ "movups %%xmm14, 0xe0(%1)\n\t"
+ "movups %%xmm15, 0xf0(%1)\n\t"
+ : : "r" (from), "r" (to) : "memory");
+
+ from += 256;
+ to += 256;
+ }
+
+trailer:
+ __memcpy(to, from, len & 0xff);
+
+ kernel_fpu_end();
+
+ return p;
+}
+EXPORT_SYMBOL_GPL(__sse_memcpy);
--
1.7.6.134.gcf13f6
--
Regards/Gruss,
Boris.
View attachment "kernel_build.sizes" of type "text/plain" (926 bytes)
Powered by blists - more mailing lists