[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <cover.1301324270.git.luto@mit.edu>
Date: Mon, 28 Mar 2011 11:06:40 -0400
From: Andy Lutomirski <luto@....EDU>
To: x86@...nel.org
Cc: linux-kernel@...r.kernel.org, John Stultz <johnstul@...ibm.com>,
Thomas Gleixner <tglx@...utronix.de>,
Andy Lutomirski <luto@....edu>
Subject: [PATCH 0/6] x86-64: Micro-optimize vclock_gettime
This series speeds up vclock_gettime(CLOCK_MONOTONIC) on by almost 30%
(tested on Sandy Bridge). They're ordered in roughly decreasing order
of improvement.
These are meant for 2.6.40, but if anyone wants to take some of them
for 2.6.39 I won't object.
The changes and timings (fastest of 20 trials of 100M iters on Sandy
Bridge) are:
Unpatched:
CLOCK_MONOTONIC: 22.09ns
CLOCK_REALTIME_COARSE: 4.23ns
CLOCK_MONOTONIC_COARSE: 5.65ns
x86-64: Optimize vread_tsc's barriers
This replaces lfence;rdtsc;lfence with a faster sequence with similar
ordering guarantees.
CLOCK_MONOTONIC: 18.28ns
CLOCK_REALTIME_COARSE: 4.23ns
CLOCK_MONOTONIC_COARSE: 5.98ns
x86-64: Don't generate cmov in vread_tsc
GCC likes to generate a cmov on a branch that's almost completely
predictable. Force it to generate a real branch instead.
CLOCK_MONOTONIC: 16.30ns
CLOCK_REALTIME_COARSE: 4.23ns
CLOCK_MONOTONIC_COARSE: 5.95ns
x86-64: Put vsyscall_gtod_data at a fixed virtual address
Because vsyscall_gtod_data's address isn't known until load time, the
code contains unnecessary address calculations. Hardcode it. This is
a nice speedup for the _COARSE variants as well.
CLOCK_MONOTONIC: 16.12ns
CLOCK_REALTIME_COARSE: 3.70ns
CLOCK_MONOTONIC_COARSE: 5.31ns
x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0
vset_normalize_timespec was more general than necessary. Open-code
the appropriate normalization loops. This is a big win for
CLOCK_MONOTONIC_COARSE
CLOCK_MONOTONIC: 16.09ns
CLOCK_REALTIME_COARSE: 3.70ns
CLOCK_MONOTONIC_COARSE: 4.49ns
x86-64: Omit frame pointers on vread_tsc
This is a bit silly and needs work for gcc < 4.4 (if we even care),
but, rather surprisingly, it's 0.3ns faster. I guess that the CPU's
stack frame optimizations aren't quite as good as I thought.
CLOCK_MONOTONIC: 15.79ns
CLOCK_REALTIME_COARSE: 3.70ns
CLOCK_MONOTONIC_COARSE: 4.50ns
x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO
We're building the vDSO with optimizations disabled that were meant
for kernel code. Override that, except for -fno-omit-frame-pointers,
which might make userspace debugging harder.
CLOCK_MONOTONIC: 15.66ns
CLOCK_REALTIME_COARSE: 3.44ns
CLOCK_MONOTONIC_COARSE: 4.23ns
Andy Lutomirski (6):
x86-64: Optimize vread_tsc's barriers
x86-64: Don't generate cmov in vread_tsc
x86-64: Put vsyscall_gtod_data at a fixed virtual address
x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0
x86-64: Omit frame pointers on vread_tsc
x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO
arch/x86/kernel/tsc.c | 48 ++++++++++++++++++++++++++++++++-------
arch/x86/kernel/vmlinux.lds.S | 13 +++++-----
arch/x86/vdso/Makefile | 15 +++++++++++-
arch/x86/vdso/vclock_gettime.c | 40 ++++++++++++++++++---------------
arch/x86/vdso/vextern.h | 9 ++++++-
5 files changed, 90 insertions(+), 35 deletions(-)
--
1.7.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists