lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 11 Jun 2012 17:44:16 -0400
From:	Benjamin Redelings <benjamin.redelings@...cent.org>
To:	linux-kernel@...r.kernel.org
Subject: Re: [RFCv5 00/19] perf: Add backtrace post dwarf unwind

Hi,

     I want to say that these patches have been quite useful to me as a 
non-kernel-developer doing ordinary CPU profiling on x86_64, and I do 
hope they enter mainline soon.  I've been able to track down a ton of 
CPU time wasted in library functions like _int_malloc and strcmp_ssse3 
using preliminary versions of the patch (v4 and v5) .  Since symbols 
like _int_malloc are in system libraries like libc-2.13.so  or 
libstdc++.so.6.0.17, I couldn't tell what functions were responsible for 
calling them.  (I am not ready to recompile the C library with frame 
pointers just to get accurate profiling information!)  The post-hoc 
DWARF unwinder makes things "just work" without recompiling all system 
libraries, and I'm already relying on it to find further speedups.


     Now for some details.

1. [FIXED in v5] The 'perf script' output now gets the DSO for for each 
address in the backtrace correct, instead of assuming that all stack 
addresses have the same DSO as the stack top.

2. The 'perf report' command is still much slower when the samples are 
taken with dwarf than when they are taken using frame pointers.  
Specifically, it took about 80 seconds with DWARF, and less than 1 
second with fp.  However, I appreciate that these are run-time options, 
so that people who want the speed can use fp instead of DWARF.  Is there 
much chances of this post-processing getting faster?

3. The main (only?) symbol that didn't get a backtrace was __times in 
libc6.  I'm not clear why this would be different.  For example:

bali-phy 25064 10751.879731: cycles:
         ffffffff810136b4 __cycles_2_ns ([kernel.kallsyms])
             7f2b8c30332a __times (/lib/x86_64-linux-gnu/libc-2.13.so)
                       1c [unknown] ([unknown])

...

bali-phy 25064 10751.935645: cycles:
         ffffffff8103eb33 jiffies_to_clock_t ([kernel.kallsyms])
             7f2b8c30332a __times (/lib/x86_64-linux-gnu/libc-2.13.so)
             7fff0baf0540 [unknown] ([unknown])

...

bali-phy 25064 10752.059581: cycles:
         ffffffff8103eb33 jiffies_to_clock_t ([kernel.kallsyms])
             7f2b8c30332a __times (/lib/x86_64-linux-gnu/libc-2.13.so)
             7fff0baefd60 [unknown] ([unknown])

I know this is being called by my program.  From gdb:

#0  __times (buf=0x7fffffffc710) at ../sysdeps/unix/sysv/linux/times.c:28
#1  0x00000000005d3f85 in boost::chrono::process_user_cpu_clock::now ()
     at 
../../../../../master/boost/include/boost/chrono/detail/inlined/posix/process_cpu_clocks.hpp:124
#2  0x000000000053e8e3 in total_cpu_time ()
     at ../../../master/src/timer_stack.C:38

4.  Hmm... running the same profiling command with "perf  record -g fp 
<command>" appears to not record backtraces at all, even though my 
program was compiled with frame pointers.  Perhaps this part broke?

-BenRI

P.S. This is using the latest git version of libunwind.  I applied v5 of 
Jiri's perf-unwind patches against Linus's currently tip.

P.P.S. I've attached a profile graph from gprof2dot.py 
(http://code.google.com/p/jrfonseca/wiki/Gprof2Dot).  Now that I can get 
backtraces from library symbols, 97% of cpu is spent in main and 
children of main, versus about 80% before.  (I'm not sure we should 
expect 100% even if all backtraces work perfectly, since functions with 
very few samples may be ignored.)

On 06/11/2012 09:19 AM, Jiri Olsa wrote:
> hi,
> besides fixing several issues, going back to the original design
> because the last one was considered too generic.. now we have:
>
>   sample_regs_user  - != 0 triggers the user level regs dump
>   sample_stack_user - != 0 triggers the user stack dump
>
> We can allway extend this in future with new mask and flags
> for IRQ/PEBS regs.
>
> patches available also as tarball in here:
> http://people.redhat.com/~jolsa/perf_post_unwind_v5.tar.bz2
>
> v5 changes:
>     patch 1/19 - having just one enum set of the perf registers
>     patch 2/19 - using for_each_set_bit for scanning the mask
>                - single regs enum for both 32 and 64 bits versions
>                - using regs mask != 0 trigger to trigger the regs dump
>     patch 5/19 - adding perf_output_skip so we can skip undumped part of the stack in RB
>     patch 6/19 - using stack size != 0 trigger to trigger the stack dump
>                - do not zero the memory for non retrieved part of the stack dump
>     patch 7/19 - adding exclude_callchain_kernel attribute
>     patch 8/19 - this could be taken without the rest of the series
>
> v4 changes:
>     - no real change from v3, just rebase
>     - v3 patch 06/17 got already merged
>
> v3 changes:
>     patch 01/17
>     - added HAVE_PERF_REGS config option
>     patch 02/17, 04/17
>     - regs and stack perf interface is more general now
>     patch 06/17
>     - unrelated online fix for i386 compilation
>     patch 16/17
>     - few namespace fixies
>
> ---
> Adding the post unwinding user stack backtrace using dwarf unwind
> via libunwind. The original work was done by Frederic. I mostly took
> his patches and make them compile in current kernel code plus I added
> some stuff here and there.
>
> The main idea is to store user registers and portion of user
> stack when the sample data during the record phase. Then during
> the report, when the data is presented, perform the actual dwarf
> dwarf unwind.
>
> attached patches:
>    01/19 perf: Unified API to record selective sets of arch registers
>    02/19 perf: Add ability to attach user level registers dump to sample
>    03/19 perf, x86: Add copy_from_user_nmi_nochk for best effort copy
>    04/19 perf: Factor __output_copy to be usable with specific copy function
>    05/19 perf: Add perf_output_skip function to skip bytes in sample
>    06/19 perf: Add ability to attach user stack dump to sample
>    07/19 perf: Add attribute to filter out callchains
>    08/19 perf, tool: Remove unsused evsel parameter from machine__resolve_callchain
>    09/19 perf, tool: Factor DSO symtab types to generic binary types
>    10/19 perf, tool: Add interface to read DSO image data
>    11/19 perf, tool: Add '.note' check into search for NOTE section
>    12/19 perf, tool: Back [vdso] DSO with real data
>    13/19 perf, tool: Add interface to arch registers sets
>    14/19 perf, tool: Add libunwind dependency for dwarf cfi unwinding
>    15/19 perf, tool: Support user regs and stack in sample parsing
>    16/19 perf, tool: Support for dwarf cfi unwinding on post processing
>    17/19 perf, tool: Support for dwarf mode callchain on perf record
>    18/19 perf, tool: Add dso data caching
>    19/19 perf, tool: Add dso data caching tests
>
>
> I tested on Fedora. There was not much gain on i386, because the
> binaries are compiled with frame pointers. Thought the dwarf
> backtrace is more accurade and unwraps calls in more details
> (functions that do not set the frame pointers).
>
> I could see some improvement on x86_64, where I got full backtrace
> where current code could got just the first address out of the
> instruction pointer.
>
> Example on x86_64:
> [dwarf]
>     perf record -g -e syscalls:sys_enter_write date
>
>     100.00%     date  libc-2.14.90.so  [.] __GI___libc_write
>                 |
>                 --- __GI___libc_write
>                     _IO_file_write@@GLIBC_2.2.5
>                     new_do_write
>                     _IO_do_write@@GLIBC_2.2.5
>                     _IO_file_overflow@@GLIBC_2.2.5
>                     0x4022cd
>                     0x401ee6
>                     __libc_start_main
>                     0x4020b9
>
>
> [frame pointer]
>     perf record -g fp -e syscalls:sys_enter_write date
>
>     100.00%     date  libc-2.14.90.so  [.] __GI___libc_write
>                 |
>                 --- __GI___libc_write
>
> Also I tested on coreutils binaries mainly, but I could see
> getting wider backtraces with dwarf unwind for more complex
> application like firefox.
>
> The unwind should go throught [vdso] object. I haven't studied
> the [vsyscall] yet, so not sure there.
>
> Attached patches should work on both x86 and x86_64. I did
> some initial testing so far.
>
> The unwind backtrace can be interrupted by following reasons:
>      - bug in unwind information of processed shared library
>      - bug in unwind processing code (most likely ;) )
>      - insufficient dump stack size
>      - wrong register value - x86_64 does not store whole
>        set of registers when in exception, but so far
>        it looks like RIP and RSP should be enough
>
> thanks for comments,
> jirka
> ---
>   arch/Kconfig                                       |    6 +
>   arch/x86/Kconfig                                   |    1 +
>   arch/x86/include/asm/perf_event.h                  |    2 +
>   arch/x86/include/asm/perf_regs.h                   |   34 ++
>   arch/x86/include/asm/uaccess.h                     |    2 +
>   arch/x86/kernel/Makefile                           |    2 +
>   arch/x86/kernel/perf_regs.c                        |   91 ++++
>   arch/x86/lib/usercopy.c                            |   15 +-
>   include/linux/perf_event.h                         |   24 +-
>   include/linux/perf_regs.h                          |   19 +
>   kernel/events/callchain.c                          |   25 +-
>   kernel/events/core.c                               |  132 +++++-
>   kernel/events/internal.h                           |   69 ++-
>   kernel/events/ring_buffer.c                        |   10 +-
>   tools/perf/Makefile                                |   45 ++-
>   tools/perf/arch/x86/Makefile                       |    3 +
>   tools/perf/arch/x86/include/perf_regs.h            |   80 +++
>   tools/perf/arch/x86/util/unwind.c                  |  111 ++++
>   tools/perf/builtin-record.c                        |   86 +++-
>   tools/perf/builtin-report.c                        |   24 +-
>   tools/perf/builtin-script.c                        |   56 ++-
>   tools/perf/builtin-test.c                          |    7 +-
>   tools/perf/builtin-top.c                           |    7 +-
>   tools/perf/config/feature-tests.mak                |   25 +
>   tools/perf/perf.h                                  |    9 +-
>   tools/perf/util/annotate.c                         |    2 +-
>   tools/perf/util/dso-test-data.c                    |  154 ++++++
>   tools/perf/util/event.h                            |   15 +-
>   tools/perf/util/evlist.c                           |   16 +
>   tools/perf/util/evlist.h                           |    2 +
>   tools/perf/util/evsel.c                            |   35 ++-
>   tools/perf/util/include/linux/compiler.h           |    1 +
>   tools/perf/util/map.c                              |   23 +-
>   tools/perf/util/map.h                              |    9 +-
>   tools/perf/util/perf_regs.h                        |   14 +
>   tools/perf/util/python.c                           |    3 +-
>   .../perf/util/scripting-engines/trace-event-perl.c |    3 +-
>   .../util/scripting-engines/trace-event-python.c    |    3 +-
>   tools/perf/util/session.c                          |  110 ++++-
>   tools/perf/util/session.h                          |   17 +-
>   tools/perf/util/symbol.c                           |  435 +++++++++++++---
>   tools/perf/util/symbol.h                           |   52 ++-
>   tools/perf/util/trace-event-scripting.c            |    3 +-
>   tools/perf/util/trace-event.h                      |    5 +-
>   tools/perf/util/unwind.c                           |  565 ++++++++++++++++++++
>   tools/perf/util/unwind.h                           |   34 ++
>   tools/perf/util/vdso.c                             |   90 +++
>   tools/perf/util/vdso.h                             |    8 +
>   48 files changed, 2278 insertions(+), 206 deletions(-)
>


Download attachment "call-graph-dwarf.pdf" of type "application/pdf" (71783 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ