[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20101020153500.GA5387@nowhere>
Date: Wed, 20 Oct 2010 17:35:04 +0200
From: Frederic Weisbecker <fweisbec@...il.com>
To: "Frank Ch. Eigler" <fche@...hat.com>
Cc: LKML <linux-kernel@...r.kernel.org>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Arnaldo Carvalho de Melo <acme@...hat.com>,
Paul Mackerras <paulus@...ba.org>,
Stephane Eranian <eranian@...gle.com>,
Cyrill Gorcunov <gorcunov@...nvz.org>,
Tom Zanussi <tzanussi@...il.com>,
Masami Hiramatsu <mhiramat@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>,
Robert Richter <robert.richter@....com>
Subject: Re: [RFC] perf: Dwarf cfi based user callchains
On Wed, Oct 13, 2010 at 11:13:27AM -0400, Frank Ch. Eigler wrote:
>
> Frederic Weisbecker <fweisbec@...il.com> writes:
>
> > [...]
> > This brings dwarf cfi based callchain for userspace apps that don't have
> > frame pointers.
>
> Interesting approach!
>
> > [...]
> > - it's slow. A first improvement to make it faster is to support binary
> > search from .eh_frame_hdr.
>
> In systemtap land, we did find a dramatic improvement from this too.
Yeah, linear walking on .eh_frame is not a good thing other than for
testing and debugging.
> Have you measured the cost of transcribing of potentially large chunks
> of the user stacks? We did not seriously evaluate this path, since we
> encounter megabyte+ stacks in larger userspace apps, and copying THAT
> out seemed absurd.
Actually that's quite affordable. And you don't need to dump all the
stack of a process on every samples, that would indeed be absurd.
A small chunk, starting from the stack pointer, is enough.
What is interesting is that you can play with several different
sizes of dump, the higher it is and the deeper you'll be able to
unwind, that also means more profiling overhead.
I can't measure significant throughput issues with hackbench
for example:
Normal run:
$ time ./hackbench 10
Time: 3.415
real 0m3.506s
user 0m0.257s
sys 0m6.519s
With perf record (default is cycles counter with 1000 HZ samples frequency):
$ time ./perf record ./hackbench 10
Time: 3.584
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.381 MB perf.data (~16654 samples) ]
real 0m3.748s
user 0m0.028s
sys 0m0.022s
With perf record + frame pointer based callchain capture
$ time ./perf record -g fp ./hackbench 10
Time: 3.666
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 1.426 MB perf.data (~62281 samples) ]
real 0m3.834s
user 0m0.033s
sys 0m0.046s
With perf record + 4096 bytes of stack dump in every sample
$ time ./perf record -g dwarf,4096 ./hackbench 10
Time: 3.697
[ perf record: Woken up 15 times to write data ]
[ perf record: Captured and wrote 5.156 MB perf.data (~225251 samples) ]
real 0m3.931s
user 0m0.026s
sys 0m0.135s
With perf record + 20000 bytes of stack dump in every sample
$ time ./perf record -g dwarf,20000 ./hackbench 10
Time: 3.847
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 13.349 MB perf.data (~583219 samples) ]
real 0m4.559s
user 0m0.028s
sys 0m0.329s
So there is no big differences. May be hackbench is not a good example
to highlight the possible impact though. And some tuning with higher
frequencies would make the difference more visible.
Perhaps the real impact is more on the amount to record, the data file
tends to grow quickly, obviously.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists