[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110816121604.GA29251@aftab>
Date:	Tue, 16 Aug 2011 14:16:04 +0200
From:	Borislav Petkov <bp@...64.org>
To:	"Valdis.Kletnieks@...edu" <Valdis.Kletnieks@...edu>
Cc:	Borislav Petkov <bp@...en8.de>, Ingo Molnar <mingo@...e.hu>,
	melwyn lobo <linux.melwyn@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"H. Peter Anvin" <hpa@...or.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: x86 memcpy performance
On Mon, Aug 15, 2011 at 10:34:35PM -0400, Valdis.Kletnieks@...edu wrote:
> On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said:
> 
> > Benchmarking with 10000 iterations, average results:
> > size    XM              MM              speedup
> > 119     540.58          449.491         0.8314969419
> 
> > 12273   2307.86         4042.88         1.751787902
> > 13924   2431.8          4224.48         1.737184756
> > 14335   2469.4          4218.82         1.708440514
> > 15018   2675.67         1904.07         0.711622886
> > 16374   2989.75         5296.26         1.771470902
> > 24564   4262.15         7696.86         1.805863077
> > 27852   4362.53         3347.72         0.7673805572
> > 28672   5122.8          7113.14         1.388524413
> > 30033   4874.62         8740.04         1.792967931
> 
> The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel
> really good about this till we understand what happened for those two cases.
Yep.
> Also, anytime I see "10000 iterations", I ask myself if the benchmark
> rigging took proper note of hot/cold cache issues. That *may* explain
> the two oddball results we see above - but not knowing more about how
> it was benched, it's hard to say.
Yeah, the more scrutiny this gets the better. So I've cleaned up my
setup and have attached it.
xm_mem.c does the benchmarking and in bench_memcpy() there's the
sse_memcpy call which is the SSE memcpy implementation using inline asm.
It looks like gcc produces pretty crappy code here because if I replace
the sse_memcpy call with xm_memcpy() from xm_memcpy.S - this is the
same function but in pure asm - I get much better numbers, sometimes
even over 2x. It all depends on the alignment of the buffers though.
Also, those numbers don't include the context saving/restoring which the
kernel does for us.
7491    1509.89         2346.94         1.554378381
8170    2166.81         2857.78         1.318890326
12277   2659.03         4179.31         1.571744176
13907   2571.24         4125.7          1.604558427
14319   2638.74         5799.67         2.19789466	<----
14993   2752.42         4413.85         1.603625603
16371   3479.11         5562.65         1.59887055
So please take a look and let me know what you think.
Thanks.
-- 
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
Download attachment "sse_memcpy.tar.bz2" of type "application/octet-stream" (3508 bytes)
Powered by blists - more mailing lists
 
