[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1359494770.17639.76.camel@gandalf.local.home>
Date: Tue, 29 Jan 2013 16:26:10 -0500
From: Steven Rostedt <rostedt@...dmis.org>
To: Frederic Weisbecker <fweisbec@...il.com>
Cc: Ingo Molnar <mingo@...nel.org>,
LKML <linux-kernel@...r.kernel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Li Zhong <zhong@...ux.vnet.ibm.com>,
Namhyung Kim <namhyung.kim@....com>,
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
Paul Gortmaker <paul.gortmaker@...driver.com>,
Peter Zijlstra <peterz@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>,
Sedat Dilek <sedat.dilek@...il.com>,
Gleb Natapov <gleb@...hat.com>,
Marcelo Tosatti <mtosatti@...hat.com>,
Tony Luck <tony.luck@...el.com>,
Fenghua Yu <fenghua.yu@...el.com>
Subject: Re: [GIT PULL] cputime: Full dynticks task/cputime accounting v7
On Mon, 2013-01-28 at 21:36 -0500, Steven Rostedt wrote:
> [root@...est ~]# time /work/c/kernelspin 10
>
> real 0m10.001s
> user 0m1.114s
> sys 0m8.886s
>
>
> But after enabling the force context_tracking I get this:
>
> [root@...est ~]# time /work/c/kernelspin 10
>
> real 0m10.001s
> user 0m3.769s
> sys 0m6.203s
>
>
> Consistently. All the other runs always had a 1.X 8.X ratio. This seems
> to give 3.7x 6.2x ratio consistently. Maybe it's more accurate then the
> others. Or something else is broken.
>
> I'll look deeper into it.
After lots of investigation, I believe I know why this is so. It is
pretty much as accurate as what we have now, except before it would
error on the side of the kernel, where as here we error on the side of
userspace.
I added a trace_printk where the utime and stime is updated, and applied
my (coming to mainline soon) function_graph patch that adds a
"max_depth" to limit the function graph tracing. Setting the max_depth
to 1 shows you nicely all the places that enter the kernel. Running this
against my kernelspin.c program I get:
kernelsp-2668 7d... 2837.236400: funcgraph_entry: 1.373 us | do_page_fault();
kernelsp-2668 7d... 2837.236403: funcgraph_entry: 1.409 us | do_page_fault();
kernelsp-2668 7.... 2837.236405: funcgraph_entry: 0.441 us | syscall_trace_enter();
kernelsp-2668 7.... 2837.236406: funcgraph_entry: 0.114 us | sys_brk();
kernelsp-2668 7.... 2837.236406: funcgraph_entry: 1.155 us | syscall_trace_leave();
kernelsp-2668 7d... 2837.236408: funcgraph_entry: 2.579 us | do_page_fault();
kernelsp-2668 7d... 2837.236413: funcgraph_entry: 2.720 us | do_page_fault();
kernelsp-2668 7d... 2837.236417: funcgraph_entry: 2.314 us | do_page_fault();
I then wrote a perl script to parse the timestamps (and it notices the
syscall_trace_enter and leave as entering the kernel) and it creates a
more accurate time that the program was in the kernel. It outputs
something like:
user time: 0.750846
system time: 5.695917
dropped event time: 3.525461
calc utime: 1.878059 ns
calc stime: 4.520833 ns
The first two times are what it came up with from the timestamps. As
there were events dropped, it also was able to calculate the time missed
for that.
As I used trace_printk() to show the updates of utime and stime, the
last two numbers show the calculated times that those functions came up
with. Note, the dropped events would include both the tracing as well as
updates made by this time.
It's not really that far off.
Then you gave me your config where you could not reproduce the issue. I
ran that too, and got:
user time: 0.000009
system time: 9.974731
dropped event time: 0.000000
calc utime: 0.772511 ns
calc stime: 9.201983 ns
It still errored on the side of userspace, but what I found strange was
that:
1) there were no dropped events
2) the utime/stime from the trace timestamps was much different than my
config
Looking at the two outputs of the trace I found:
my config:
kernelsp-2668 7.... 2837.239219: funcgraph_entry: 1.369 us | sys_brk();
kernelsp-2668 7.... 2837.239225: funcgraph_entry: 1.790 us | sys_brk();
kernelsp-2668 7.... 2837.239229: funcgraph_entry: 0.778 us | sys_brk();
kernelsp-2668 7.... 2837.239232: funcgraph_entry: 0.896 us | sys_brk();
your config:
kernelsp-969 6.... 207.618010: funcgraph_entry: + 11.123 us | sys_brk();
kernelsp-969 6.... 207.618024: funcgraph_entry: 7.519 us | sys_brk();
kernelsp-969 6.... 207.618033: funcgraph_entry: 7.616 us | sys_brk();
kernelsp-969 6.... 207.618043: funcgraph_entry: 7.420 us | sys_brk();
Big difference in the time spent in sys_brk()!
I did a full function graph trace and found:
kernelspin-1015 [006] 533.405921: funcgraph_entry: | sys_brk() {
kernelspin-1015 [006] 533.405921: funcgraph_entry: | down_write() {
kernelspin-1015 [006] 533.405921: funcgraph_entry: | __might_sleep() {
kernelspin-1015 [006] 533.405922: funcgraph_entry: 0.051 us | debug_lockdep_rcu_enabled();
<idle>-0 [003] 533.405922: bprint: account_system_time : system: 2368
kernelspin-1015 [006] 533.405922: funcgraph_entry: 0.044 us | debug_lockdep_rcu_enabled();
kernelspin-1015 [006] 533.405923: funcgraph_entry: 0.051 us | debug_lockdep_rcu_enabled();
kernelspin-1015 [006] 533.405923: funcgraph_exit: 1.440 us | }
kernelspin-1015 [006] 533.405923: funcgraph_exit: 2.015 us | }
kernelspin-1015 [006] 533.405924: funcgraph_entry: 0.113 us | find_vma();
[...]
kernelspin-1015 [006] 533.405938: funcgraph_entry: | set_track() {
kernelspin-1015 [006] 533.405939: funcgraph_entry: | save_stack_trace() {
trace-cmd-1010 [005] 533.405939: bprint: account_system_time : system: 3492
kernelspin-1015 [006] 533.405939: funcgraph_entry: | dump_trace() {
kernelspin-1015 [006] 533.405939: funcgraph_entry: | print_context_stack() {
kernelspin-1015 [006] 533.405939: funcgraph_entry: | __kernel_text_address() {
kernelspin-1015 [006] 533.405939: funcgraph_entry: | is_module_text_address() {
kernelspin-1015 [006] 533.405940: funcgraph_entry: | __module_text_address() {
trace-cmd-1010 [005] 533.405940: bprint: account_user_time : user: 1014
kernelspin-1015 [006] 533.405940: funcgraph_entry: 0.041 us | __module_address();
kernelspin-1015 [006] 533.405940: funcgraph_exit: 0.464 us | }
kernelspin-1015 [006] 533.405940: funcgraph_exit: 0.849 us | }
kernelspin-1015 [006] 533.405941: funcgraph_exit: 1.231 us | }
kernelspin-1015 [006] 533.405941: funcgraph_entry: | __kernel_text_address() {
kernelspin-1015 [006] 533.405941: funcgraph_entry: | is_module_text_address() {
kernelspin-1015 [006] 533.405941: funcgraph_entry: | __module_text_address() {
kernelspin-1015 [006] 533.405941: funcgraph_entry: 0.046 us | __module_address();
kernelspin-1015 [006] 533.405942: funcgraph_exit: 0.442 us | }
kernelspin-1015 [006] 533.405942: funcgraph_exit: 0.851 us | }
kernelspin-1015 [006] 533.405942: funcgraph_exit: 1.254 us | }
kernelspin-1015 [006] 533.405942: funcgraph_entry: 0.049 us | __kernel_text_address();
kernelspin-1015 [006] 533.405943: funcgraph_entry: | save_stack_address() {
kernelspin-1015 [006] 533.405943: funcgraph_entry: 0.042 us | __save_stack_address();
kernelspin-1015 [006] 533.405944: funcgraph_exit: 0.447 us | }
<idle>-0 [005] 533.405944: bprint: account_system_time : system: 3993
kernelspin-1015 [006] 533.405944: funcgraph_entry: | save_stack_address() {
kernelspin-1015 [006] 533.405944: funcgraph_entry: 0.055 us | __save_stack_address();
kernelspin-1015 [006] 533.405944: funcgraph_exit: 0.459 us | }
And it goes on and on!
Sure enough, a big difference between my config and yours, was not only
that I had slab and you had slub, but you also had lockdep, slub
debugging and a whole host of things enabled in your config that will
slow the kernel down.
I turned off all that in your config, recompiled, rebooted, and now I
get this:
[root@...est ~]# time /work/c/kernelspin 10
real 0m10.001s
user 0m4.336s
sys 0m5.636s
Thus because the sys_brk is so fast, the program literally does millions
of them. As we suffer a little error to the user side every time we do
the calculation:
[user mode]
<enter kernel> <------------+
|--- error
calculate utime <------------+
[system mode]
calculate stime <------------+
|--- error
<enter user> <------------+
That is, from the time we enter the kernel to the time we make the
user/system calculation, plus the time we make the user/system
calculation and then leave the kernel, that total time is a slight
error. And this error happens every time we enter or leave the kernel.
There's no way around that.
Because this was an unrealistic program that did sys_brk() millions of
times (and that's all it did), this program exaggerated the error. By
adding the debug options, and slowing down the sys_brk by a multitude,
we also made the % error decrease by a multitude. It even allowed us not
to lose events.
Thus, I think we can blow off this concern as just an expected error of
the dynamic calculation process. And this will only happen when we have
a CPU in no hz mode, which shouldn't be a big deal anyway, as that is
usually done by processes that don't want to go into the kernel in the
first place. Thus keeping the error down.
But this error should be documented somewhere. Maybe in the Kconfig.
You can have my Acked-by back :-)
Acked-by: Steven Rostedt <rostedt@...dmis.org>
-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists