[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49E8D91F.1060005@goop.org>
Date: Fri, 17 Apr 2009 12:31:43 -0700
From: Jeremy Fitzhardinge <jeremy@...p.org>
To: Ingo Molnar <mingo@...e.hu>
CC: Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>,
Steven Rostedt <rostedt@...dmis.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Jeremy Fitzhardinge <jeremy.fitzhardinge@...rix.com>
Subject: Re: [PATCH 1/4] tracing: move __DO_TRACE out of line
Ingo Molnar wrote:
> I meant to suggest to Jeremy to measure the effect of this
> out-of-lining, in terms of instruction count in the hotpath.
>
OK, here's a comparison for trace_sched_switch, comparing inline and out
of line tracing functions, with CONFIG_PREEMPT enabled:
The inline __DO_TRACE version of trace_sched_switch inserts 20
instructions, assembling to 114 bytes of code in the hot path:
cmpl $0, __tracepoint_sched_switch+8(%rip) #, __tracepoint_sched_switch.state
je .L1582 #,
movq %gs:per_cpu__kernel_stack,%rax # per_cpu__kernel_stack, ret__
incl -8124(%rax) # <variable>.preempt_count
movq __tracepoint_sched_switch+16(%rip), %r12 #, it_func
testq %r12, %r12 # it_func
.L1603:
je .L1583 #,
movq -136(%rbp), %rdx # next,
movq -144(%rbp), %rsi # prev,
movq %rbx, %rdi # rq,
call *(%r12) #* it_func
addq $8, %r12 #, it_func
cmpq $0, (%r12) #,* it_func
jmp .L1603 #
.L1583:
movq %gs:per_cpu__kernel_stack,%rax # per_cpu__kernel_stack, ret__
decl -8124(%rax) # <variable>.preempt_count
movq %gs:per_cpu__kernel_stack,%rax # per_cpu__kernel_stack, ret__
testb $8, -8136(%rax) #,
je .L1582 #,
call preempt_schedule #
.L1582:
Taking __do_trace_sched_switch out of lines inserts this into the hot
path (6 instructions, 31 bytes):
cmpl $0, __tracepoint_sched_switch+8(%rip) #, __tracepoint_sched_switch.state
je .L1748 #,
movq -136(%rbp), %rdx # next,
movq -144(%rbp), %rsi # prev,
movq %rbx, %rdi # rq,
call __do_trace_sched_switch #
.L1748:
__do_trace_sched_switch is a fair bit larger, mostly due to function
preamble frame and reg save/restore, and some unfortunate and
unnecessary register thrashing (why not keep rdi,rsi,rdx where they
are?). But it isn't that much larger than the inline version: 34
instructions, 118 bytes. This code will also be shared among all
instances of the tracepoint (not in this case, because sched_switch is
unique, but other tracepoints have multiple users).
__do_trace_sched_switch:
pushq %rbp #
movq %rsp, %rbp #,
pushq %r14 #
movq %rdi, %r14 # rq, rq
pushq %r13 #
movq %rsi, %r13 # prev, prev
pushq %r12 #
movq %rdx, %r12 # next, next
pushq %rbx #
movq %gs:per_cpu__kernel_stack,%rax # per_cpu__kernel_stack, ret__
incl -8124(%rax) # <variable>.preempt_count
movq __tracepoint_sched_switch+16(%rip), %rax #, _________p1
testq %rax, %rax # _________p1
je .L2403 #,
movq %rax, %rbx # _________p1, it_func
.L2404:
movq %r12, %rdx # next,
movq %r13, %rsi # prev,
movq %r14, %rdi # rq,
call *(%rbx) #* it_func
addq $8, %rbx #, it_func
cmpq $0, (%rbx) #,* it_func
jne .L2404 #,
.L2403:
movq %gs:per_cpu__kernel_stack,%rax # per_cpu__kernel_stack, ret__
decl -8124(%rax) # <variable>.preempt_count
movq %gs:per_cpu__kernel_stack,%rax # per_cpu__kernel_stack, ret__
testb $8, -8136(%rax) #,
je .L2406 #,
call preempt_schedule #
.L2406:
popq %rbx #
popq %r12 #
popq %r13 #
popq %r14 #
leave
ret
So, conclusion: putting the tracepoint code out of line significantly
reduces the hot-path code size at each tracepoint (114 bytes down to 31
in this case, 27% the size). This should reduce the overhead of having
tracing configured but not enabled. The saving won't be as large for
tracepoints with fewer arguments or without CONFIG_PREEMPT, but I chose
this example because it is realistic and undeniably a hot path. And
when doing pvops tracing, 80 new events with hundreds of callsites
around the kernel, this is really going to add up.
The tradeoff is that the actual tracing function is a little larger, but
not dramatically so. I would expect some performance hit when the
tracepoint is actually enabled. This may be mitigated increased icache
hits when a tracepoint has multiple sites.
(BTW, I realized that we don't need to pass &__tracepoint_FOO to
__do_trace_FOO(), since its always going to be the same; this simplifies
the calling convention at the callsite, and it also makes void
tracepoints work again.)
J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists