linux-kernel - Re: [PATCH 1/4] tracing: move __DO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49E8D91F.1060005@goop.org>
Date:	Fri, 17 Apr 2009 12:31:43 -0700
From:	Jeremy Fitzhardinge <jeremy@...p.org>
To:	Ingo Molnar <mingo@...e.hu>
CC:	Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>,
	Steven Rostedt <rostedt@...dmis.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Jeremy Fitzhardinge <jeremy.fitzhardinge@...rix.com>
Subject: Re: [PATCH 1/4] tracing: move __DO_TRACE out of line

Ingo Molnar wrote:
> I meant to suggest to Jeremy to measure the effect of this 
> out-of-lining, in terms of instruction count in the hotpath.
>   

OK, here's a comparison for trace_sched_switch, comparing inline and out 
of line tracing functions, with CONFIG_PREEMPT enabled:

The inline __DO_TRACE version of trace_sched_switch inserts 20 
instructions, assembling to 114 bytes of code in the hot path:

        cmpl    $0, __tracepoint_sched_switch+8(%rip)   #, __tracepoint_sched_switch.state
        je      .L1582  #,
        movq %gs:per_cpu__kernel_stack,%rax     # per_cpu__kernel_stack, ret__
        incl    -8124(%rax)     # <variable>.preempt_count
        movq    __tracepoint_sched_switch+16(%rip), %r12        #, it_func
        testq   %r12, %r12      # it_func
.L1603:
        je      .L1583  #,
        movq    -136(%rbp), %rdx        # next,
        movq    -144(%rbp), %rsi        # prev,
        movq    %rbx, %rdi      # rq,
        call    *(%r12) #* it_func
        addq    $8, %r12        #, it_func
        cmpq    $0, (%r12)      #,* it_func
        jmp     .L1603  #
.L1583:
        movq %gs:per_cpu__kernel_stack,%rax     # per_cpu__kernel_stack, ret__
        decl    -8124(%rax)     # <variable>.preempt_count
        movq %gs:per_cpu__kernel_stack,%rax     # per_cpu__kernel_stack, ret__
        testb   $8, -8136(%rax) #,
        je      .L1582  #,
        call    preempt_schedule        #
.L1582:


Taking __do_trace_sched_switch out of lines inserts this into the hot 
path (6 instructions, 31 bytes):

        cmpl    $0, __tracepoint_sched_switch+8(%rip)   #, __tracepoint_sched_switch.state
        je      .L1748  #,
        movq    -136(%rbp), %rdx        # next,
        movq    -144(%rbp), %rsi        # prev,
        movq    %rbx, %rdi      # rq,
        call    __do_trace_sched_switch #
.L1748:


__do_trace_sched_switch is a fair bit larger, mostly due to function 
preamble frame and reg save/restore, and some unfortunate and 
unnecessary register thrashing (why not keep rdi,rsi,rdx where they 
are?).  But it isn't that much larger than the inline version: 34 
instructions, 118 bytes.  This code will also be shared among all 
instances of the tracepoint (not in this case, because sched_switch is 
unique, but other tracepoints have multiple users).

__do_trace_sched_switch:
        pushq   %rbp    #
        movq    %rsp, %rbp      #,
        pushq   %r14    #
        movq    %rdi, %r14      # rq, rq
        pushq   %r13    #
        movq    %rsi, %r13      # prev, prev
        pushq   %r12    #
        movq    %rdx, %r12      # next, next
        pushq   %rbx    #
        movq %gs:per_cpu__kernel_stack,%rax     # per_cpu__kernel_stack, ret__
        incl    -8124(%rax)     # <variable>.preempt_count
        movq    __tracepoint_sched_switch+16(%rip), %rax        #, _________p1
        testq   %rax, %rax      # _________p1
        je      .L2403  #,
        movq    %rax, %rbx      # _________p1, it_func
.L2404:
        movq    %r12, %rdx      # next,
        movq    %r13, %rsi      # prev,
        movq    %r14, %rdi      # rq,
        call    *(%rbx) #* it_func
        addq    $8, %rbx        #, it_func
        cmpq    $0, (%rbx)      #,* it_func
        jne     .L2404  #,
.L2403:
        movq %gs:per_cpu__kernel_stack,%rax     # per_cpu__kernel_stack, ret__
        decl    -8124(%rax)     # <variable>.preempt_count
        movq %gs:per_cpu__kernel_stack,%rax     # per_cpu__kernel_stack, ret__
        testb   $8, -8136(%rax) #,
        je      .L2406  #,
        call    preempt_schedule        #
.L2406:
        popq    %rbx    #
        popq    %r12    #
        popq    %r13    #
        popq    %r14    #
        leave
        ret


So, conclusion: putting the tracepoint code out of line significantly 
reduces the hot-path code size at each tracepoint (114 bytes down to 31 
in this case, 27% the size).  This should reduce the overhead of having 
tracing configured but not enabled.  The saving won't be as large for 
tracepoints with fewer arguments or without CONFIG_PREEMPT, but I chose 
this example because it is realistic and undeniably a hot path.  And 
when doing pvops tracing, 80 new events with hundreds of callsites 
around the kernel, this is really going to add up.

The tradeoff is that the actual tracing function is a little larger, but 
not dramatically so.  I would expect some performance hit when the 
tracepoint is actually enabled.  This may be mitigated increased icache 
hits when a tracepoint has multiple sites.

(BTW, I realized that we don't need to pass &__tracepoint_FOO to 
__do_trace_FOO(), since its always going to be the same; this simplifies 
the calling convention at the callsite, and it also makes void 
tracepoints work again.)

    J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/