[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20151019174744.GA2031@gmail.com>
Date: Mon, 19 Oct 2015 19:47:44 +0200
From: Ingo Molnar <mingo@...nel.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Peter Zijlstra <peterz@...radead.org>,
Arnaldo Carvalho de Melo <acme@...hat.com>,
Namhyung Kim <namhyung@...nel.org>,
David Ahern <dsahern@...il.com>, Jiri Olsa <jolsa@...hat.com>,
Hitoshi Mitake <mitake@....info.waseda.ac.jp>,
Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH 02/14] perf/bench: Default to all routines in 'perf bench
mem'
* Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> On Mon, Oct 19, 2015 at 1:04 AM, Ingo Molnar <mingo@...nel.org> wrote:
> >
> > triton:~> perf bench mem all
> > # Running mem/memcpy benchmark...
> > Routine default (Default memcpy() provided by glibc)
> > 4.957170 GB/Sec (with prefault)
> > Routine x86-64-unrolled (unrolled memcpy() in arch/x86/lib/memcpy_64.S)
> > 4.379204 GB/Sec (with prefault)
> > Routine x86-64-movsq (movsq-based memcpy() in arch/x86/lib/memcpy_64.S)
> > 4.264465 GB/Sec (with prefault)
> > Routine x86-64-movsb (movsb-based memcpy() in arch/x86/lib/memcpy_64.S)
> > 6.554111 GB/Sec (with prefault)
>
> Is this skylake? And why are the numbers so low? Even on my laptop
> (Haswell), I get ~21GB/s (when setting cpufreq to performance).
No, this was on my desktop, which is a water cooled IvyBridge running at 3.6GHz:
processor : 11
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Core(TM) i7-4960X CPU @ 3.60GHz
stepping : 4
microcode : 0x416
cpu MHz : 1303.031
cache size : 15360 KB
and I didn't really think about the validity of the numbers when I made the
changelog, as I rarely benchmark on this box, due to it having various desktop
loads running all the time.
AAs you noticed the results are highly variable with default settings:
triton:~/tip> taskset 1 perf stat --null --repeat 10 perf bench mem memcpy -f x86-64-movsb 2>&1 | grep GB
5.580357 GB/sec
5.580357 GB/sec
16.551907 GB/sec
16.551907 GB/sec
15.258789 GB/sec
16.837284 GB/sec
16.837284 GB/sec
16.837284 GB/sec
16.551907 GB/sec
16.837284 GB/sec
They get more reliable with '-l 10000' (10,000 loops instead of the default 1):
triton:~/tip> taskset 1 perf stat --null --repeat 10 perf bench mem memcpy -f x86-64-movsb -l 10000 2>&1 | grep GB
15.483591 GB/sec
16.975429 GB/sec
17.088396 GB/sec
20.920407 GB/sec
21.346655 GB/sec
21.322372 GB/sec
21.338306 GB/sec
21.342130 GB/sec
21.339984 GB/sec
21.373145 GB/sec
that's purely cached. Also note how after a few seconds it gets faster, due to
cpufreq as you suspected.
So once I fix the frequency of all cores to the max, I get much more reliable
results:
triton:~/tip> taskset 1 perf stat --null --repeat 10 perf bench mem memcpy -f x86-64-movsb -l 10000 2>&1 | grep -E 'GB|elaps'
21.356879 GB/sec
21.378526 GB/sec
21.351976 GB/sec
21.375203 GB/sec
21.369824 GB/sec
21.353236 GB/sec
21.283708 GB/sec
21.380679 GB/sec
21.347915 GB/sec
21.378572 GB/sec
0.459286278 seconds time elapsed ( +- 0.04% )
I'll add a debug check to 'perf bench' to warn about systems that have variable
cpufreq running - this is too easy a mistake to make :-/
So with the benchmark stabilized, I get the following results:
triton:~/tip> taskset 1 perf bench mem memcpy -f all -l 10000
# Running 'mem/memcpy' benchmark:
# function 'default' (Default memcpy() provided by glibc)
# Copying 1MB bytes ...
18.356783 GB/sec
# function 'x86-64-unrolled' (unrolled memcpy() in arch/x86/lib/memcpy_64.S)
# Copying 1MB bytes ...
16.294889 GB/sec
# function 'x86-64-movsq' (movsq-based memcpy() in arch/x86/lib/memcpy_64.S)
# Copying 1MB bytes ...
15.760032 GB/sec
# function 'x86-64-movsb' (movsb-based memcpy() in arch/x86/lib/memcpy_64.S)
# Copying 1MB bytes ...
21.145818 GB/sec
which matches your observations:
> It's interesting that 'movsb' for you is so much better. It's been
> promising before, and it *should* be able to do better than manual
> copying, but it's not been that noticeable on the machines I've
> tested. But I haven't ued Skylake or Broadwell yet.
>
> cpufreq might be making a difference too. Maybe it's just ramping up
> the CPU? Or is that really repeatable?
So modulo the cpufreq multiplier it seems repeatable on this IB system - will try
it on SkyLake as well.
Before relying on it I also wanted to implement the following 'perf bench'
improvements:
- make it more representative of kernel usage by benchmarking a list of
characteristic lengths, not just the single stupid 1MB buffer. At smaller
buffer sizes I'd expect MOVSB to have even more of a fundamental advantage (due
to having all the differentiation in hardware) - but we don't know the
latencies of those cases, some of which are in microcode I suspect.
- measure aligned/unaligned buffer address and length effects as well
- measure cache-cold numbers as well. This is pretty hard but not impossible.
With that we could start validating our fundamental memory op routines in
user-space.
Thanks,
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists