[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150225122701.GK5029@twins.programming.kicks-ass.net>
Date: Wed, 25 Feb 2015 13:27:01 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Andi Kleen <ak@...ux.intel.com>
Cc: Andi Kleen <andi@...stfloor.org>, x86@...nel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/3] x86: Move msr accesses out of line
On Mon, Feb 23, 2015 at 09:43:40AM -0800, Andi Kleen wrote:
> On Mon, Feb 23, 2015 at 06:04:36PM +0100, Peter Zijlstra wrote:
> > On Fri, Feb 20, 2015 at 05:38:55PM -0800, Andi Kleen wrote:
> >
> > > This patch moves the MSR functions out of line. A MSR access is typically
> > > 40-100 cycles or even slower, a call is a few cycles at best, so the
> > > additional function call is not really significant.
> >
> > If I look at the below PDF a CALL+PUSH EBP+MOV RSP,RBP+ ... +POP+RET
> > ends up being 5+1.5+0.5+ .. + 1.5+8 = 16.5 + .. cycles.
>
> You cannot just add up the latency cycles. The CPU runs all of this
> in parallel.
>
> Latency cycles would only be interesting if these instructions were
> on the critical path for computing the result, which they are not.
>
> It should be a few cycles overhead.
I thought that since CALL touches RSP, PUSH touches RSP, MOV RSP,
(obviously) touches RSP, POP touches RSP and well, RET does too. There
were strong dependencies on the instructions and there would be little
room to parallelize things.
I'm glad you so patiently educated me on the wonders of modern
architectures and how it can indeed do all this in parallel.
Still, I wondered, so I ran me a little test. Note that I used a
serializing instruction (LOCK XCHG) because WRMSR is too.
I see a ~14 cycle difference between the inline and noinline version.
If I substitute the LOCK XCHG with XADD, I get to 1,5 cycles in
difference, so clearly there is some magic happening, but serializing
instructions wreck it.
Anybody can explain how such RSP deps get magiced away?
---
root@...-ep:~# cat call.c
#define __always_inline inline __attribute__((always_inline))
#define noinline __attribute__((noinline))
static int
#ifdef FOO
noinline
#else
__always_inline
#endif
xchg(int *ptr, int val)
{
asm volatile ("LOCK xchgl %0, %1\n"
: "+r" (val), "+m" (*(ptr))
: : "memory", "cc");
return val;
}
void main(void)
{
int val = 0, old;
for (int i = 0; i < 1000000000; i++)
old = xchg(&val, i);
}
root@...-ep:~# gcc -std=gnu99 -O3 -fno-omit-frame-pointer -DFOO -o call call.c
root@...-ep:~# objdump -D call | awk '/<[^>]*>:/ {p=0} /<main>:/ {p=1} /<xchg>:/ {p=1} { if (p) print $0 }'
00000000004003e0 <main>:
4003e0: 55 push %rbp
4003e1: 48 89 e5 mov %rsp,%rbp
4003e4: 53 push %rbx
4003e5: 31 db xor %ebx,%ebx
4003e7: 48 83 ec 18 sub $0x18,%rsp
4003eb: c7 45 e0 00 00 00 00 movl $0x0,-0x20(%rbp)
4003f2: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
4003f8: 48 8d 7d e0 lea -0x20(%rbp),%rdi
4003fc: 89 de mov %ebx,%esi
4003fe: 83 c3 01 add $0x1,%ebx
400401: e8 fa 00 00 00 callq 400500 <xchg>
400406: 81 fb 00 ca 9a 3b cmp $0x3b9aca00,%ebx
40040c: 75 ea jne 4003f8 <main+0x18>
40040e: 48 83 c4 18 add $0x18,%rsp
400412: 5b pop %rbx
400413: 5d pop %rbp
400414: c3 retq
0000000000400500 <xchg>:
400500: 55 push %rbp
400501: 89 f0 mov %esi,%eax
400503: 48 89 e5 mov %rsp,%rbp
400506: f0 87 07 lock xchg %eax,(%rdi)
400509: 5d pop %rbp
40050a: c3 retq
40050b: 90 nop
40050c: 90 nop
40050d: 90 nop
40050e: 90 nop
40050f: 90 nop
root@...-ep:~# gcc -std=gnu99 -O3 -fno-omit-frame-pointer -o call-inline call.c
root@...-ep:~# objdump -D call-inline | awk '/<[^>]*>:/ {p=0} /<main>:/ {p=1} /<xchg>:/ {p=1} { if (p) print $0 }'
00000000004003e0 <main>:
4003e0: 55 push %rbp
4003e1: 31 c0 xor %eax,%eax
4003e3: 48 89 e5 mov %rsp,%rbp
4003e6: c7 45 f0 00 00 00 00 movl $0x0,-0x10(%rbp)
4003ed: 0f 1f 00 nopl (%rax)
4003f0: 89 c2 mov %eax,%edx
4003f2: f0 87 55 f0 lock xchg %edx,-0x10(%rbp)
4003f6: 83 c0 01 add $0x1,%eax
4003f9: 3d 00 ca 9a 3b cmp $0x3b9aca00,%eax
4003fe: 75 f0 jne 4003f0 <main+0x10>
400400: 5d pop %rbp
400401: c3 retq
root@...-ep:~# perf stat -e "cycles:u" ./call
Performance counter stats for './call':
36,309,274,162 cycles:u
10.561819310 seconds time elapsed
root@...-ep:~# perf stat -e "cycles:u" ./call-inline
Performance counter stats for './call-inline':
22,004,045,745 cycles:u
6.498271508 seconds time elapsed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists