[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191022220259.70f7306b@gandalf.local.home>
Date: Tue, 22 Oct 2019 22:02:59 -0400
From: Steven Rostedt <rostedt@...dmis.org>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
LKML <linux-kernel@...r.kernel.org>, X86 ML <x86@...nel.org>,
Nadav Amit <nadav.amit@...il.com>,
Andy Lutomirski <luto@...nel.org>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Song Liu <songliubraving@...com>,
Masami Hiramatsu <mhiramat@...nel.org>
Subject: Re: [PATCH 3/3] x86/ftrace: Use text_poke()
On Tue, 22 Oct 2019 18:17:40 -0400
Steven Rostedt <rostedt@...dmis.org> wrote:
> > your solution is to reduce the overhead.
> > my solution is to remove it competely. See the difference?
>
> You're just trimming it down. I'm curious to what overhead you save by
> not saving all parameter registers, and doing a case by case basis?
I played with adding a ftrace_ip_caller, that only modifies the ip, and
passes a regs with bare minimum saved (allowing the callback to modify
the regs->ip).
I modified the test-events-sample to just hijack the do_raw_spin_lock()
function, which is a very hot path, as it is called by most of the
spin_lock code.
I then ran:
# perf stat -r 20 /work/c/hackbench 50
vanilla, as nothing is happening.
I then enabled the test-events-sample with the new ftrace ip
modification only.
I then had it do the hijack of do_raw_spin_lock() with the current
ftrace_regs_caller that kprobes and live kernel patching use.
The improvement was quite noticeable.
Baseline: (no tracing)
# perf stat -r 20 /work/c/hackbench 50
Time: 1.146
Time: 1.120
Time: 1.102
Time: 1.099
Time: 1.136
Time: 1.123
Time: 1.128
Time: 1.115
Time: 1.111
Time: 1.192
Time: 1.160
Time: 1.156
Time: 1.135
Time: 1.144
Time: 1.150
Time: 1.120
Time: 1.121
Time: 1.120
Time: 1.106
Time: 1.127
Performance counter stats for '/work/c/hackbench 50' (20 runs):
9,056.18 msec task-clock # 6.770 CPUs utilized ( +- 0.26% )
148,461 context-switches # 0.016 M/sec ( +- 2.58% )
18,397 cpu-migrations # 0.002 M/sec ( +- 3.24% )
54,029 page-faults # 0.006 M/sec ( +- 0.63% )
33,227,808,738 cycles # 3.669 GHz ( +- 0.25% )
23,314,273,335 stalled-cycles-frontend # 70.16% frontend cycles idle ( +- 0.25% )
23,568,794,330 instructions # 0.71 insn per cycle
# 0.99 stalled cycles per insn ( +- 0.26% )
3,756,521,438 branches # 414.802 M/sec ( +- 0.29% )
36,949,690 branch-misses # 0.98% of all branches ( +- 0.46% )
1.3377 +- 0.0213 seconds time elapsed ( +- 1.59% )
Using IPMODIFY without regs:
# perf stat -r 20 /work/c/hackbench 50
Time: 1.193
Time: 1.115
Time: 1.145
Time: 1.129
Time: 1.121
Time: 1.132
Time: 1.136
Time: 1.156
Time: 1.133
Time: 1.131
Time: 1.138
Time: 1.150
Time: 1.147
Time: 1.137
Time: 1.178
Time: 1.121
Time: 1.142
Time: 1.158
Time: 1.143
Time: 1.168
Performance counter stats for '/work/c/hackbench 50' (20 runs):
9,231.39 msec task-clock # 6.917 CPUs utilized ( +- 0.39% )
136,822 context-switches # 0.015 M/sec ( +- 3.06% )
17,308 cpu-migrations # 0.002 M/sec ( +- 2.61% )
54,521 page-faults # 0.006 M/sec ( +- 0.57% )
33,875,937,399 cycles # 3.670 GHz ( +- 0.39% )
23,575,136,580 stalled-cycles-frontend # 69.59% frontend cycles idle ( +- 0.41% )
24,246,404,007 instructions # 0.72 insn per cycle
# 0.97 stalled cycles per insn ( +- 0.33% )
3,878,453,510 branches # 420.138 M/sec ( +- 0.36% )
47,653,911 branch-misses # 1.23% of all branches ( +- 0.43% )
1.33462 +- 0.00440 seconds time elapsed ( +- 0.33% )
The current ftrace_regs_caller: (old way of doing it)
# perf stat -r 20 /work/c/hackbench 50
Time: 1.114
Time: 1.153
Time: 1.236
Time: 1.208
Time: 1.179
Time: 1.183
Time: 1.217
Time: 1.190
Time: 1.157
Time: 1.172
Time: 1.215
Time: 1.165
Time: 1.171
Time: 1.188
Time: 1.176
Time: 1.217
Time: 1.341
Time: 1.238
Time: 1.363
Time: 1.287
Performance counter stats for '/work/c/hackbench 50' (20 runs):
9,522.76 msec task-clock # 6.653 CPUs utilized ( +- 0.36% )
131,347 context-switches # 0.014 M/sec ( +- 3.37% )
17,090 cpu-migrations # 0.002 M/sec ( +- 3.56% )
54,126 page-faults # 0.006 M/sec ( +- 0.71% )
34,942,273,861 cycles # 3.669 GHz ( +- 0.35% )
24,517,757,272 stalled-cycles-frontend # 70.17% frontend cycles idle ( +- 0.35% )
24,684,124,265 instructions # 0.71 insn per cycle
# 0.99 stalled cycles per insn ( +- 0.34% )
3,859,734,617 branches # 405.317 M/sec ( +- 0.38% )
47,286,857 branch-misses # 1.23% of all branches ( +- 0.70% )
1.4313 +- 0.0244 seconds time elapsed ( +- 1.70% )
In summary we have:
Baseline: (no tracing)
33,227,808,738 cycles # 3.669 GHz ( +- 0.25% )
1.3377 +- 0.0213 seconds time elapsed ( +- 1.59% )
Just ip modification:
33,875,937,399 cycles # 3.670 GHz ( +- 0.39% )
1.33462 +- 0.00440 seconds time elapsed ( +- 0.33% )
Full regs as done today:
34,942,273,861 cycles # 3.669 GHz ( +- 0.35% )
1.4313 +- 0.0244 seconds time elapsed ( +- 1.70% )
Note, the hijack function is done very naively, the function trampoline
is called twice.
do_raw_spin_lock()
call ftrace_ip_caller
call my_highjack_func
test pip (true)
call my_do_spin_lock()
call do_raw_spin_lock()
call ftrace_ip_caller
call my_highjack_func
test pip (false)
We could work on ways to not do this double call, but I just wanted to
get some kind of test case out.
This method could at least help with live kernel patching.
Attached is the patch I used. To switch back to full regs, just
remove the comma and uncomment the SAVE_REGS flag in the
trace-events-samples.c file.
-- Steve
View attachment "test.patch" of type "text/x-patch" (15761 bytes)
Powered by blists - more mailing lists