[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <37D7C6CF3E00A74B8858931C1DB2F077015A8C0A@SHSMSX103.ccr.corp.intel.com>
Date: Fri, 5 Sep 2014 14:25:06 +0000
From: "Liang, Kan" <kan.liang@...el.com>
To: "a.p.zijlstra@...llo.nl" <a.p.zijlstra@...llo.nl>
CC: "mingo@...nel.org" <mingo@...nel.org>,
"acme@...radead.org" <acme@...radead.org>,
"eranian@...gle.com" <eranian@...gle.com>,
"andi@...stfloor.org" <andi@...stfloor.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v5 00/16] perf, x86: Haswell LBR call stack support
Hi Peter and all,
Did you get a chance to review these patches?
Zheng is away. Should I re-send the patches?
Thanks,
Kan
>
> For many profiling tasks we need the callgraph. For example we often need
> to see the caller of a lock or the caller of a memcpy or other library function
> to actually tune the program. Frame pointer unwinding is efficient and works
> well. But frame pointers are off by default on 64bit code (and on modern
> 32bit gccs), so there are many binaries around that do not use frame pointers.
> Profiling unchanged production code is very useful in practice. On some CPUs
> frame pointer also has a high cost. Dwarf2 unwinding also does not always
> work and is extremely slow (upto 20% overhead).
>
> Haswell has a new feature that utilizes the existing Last Branch Record facility
> to record call chains. When the feature is enabled, function call will be
> collected as normal, but as return instructions are executed the last captured
> branch record is popped from the on-chip LBR registers. The LBR call stack
> facility provides an alternative to get callgraph. It has some limitations too,
> but should work in most cases and is significantly faster than dwarf. Frame
> pointer unwinding is still the best default, but LBR call stack is a good
> alternative when nothing else works.
>
> When profiling bc(1) on Fedora 19:
> echo 'scale=2000; 4*a(1)' > cmd; perf record -g fp bc -l < cmd
>
> If this feature is enabled, perf report output looks like:
> 50.36% bc bc [.] bc_divide
> |
> --- bc_divide
> execute
> run_code
> yyparse
> main
> __libc_start_main
> _start
>
> 33.66% bc bc [.] _one_mult
> |
> --- _one_mult
> bc_divide
> execute
> run_code
> yyparse
> main
> __libc_start_main
> _start
>
> 7.62% bc bc [.] _bc_do_add
> |
> --- _bc_do_add
> |
> |--99.89%-- 0x2000186a8
> --0.11%-- [...]
>
> 6.83% bc bc [.] _bc_do_sub
> |
> --- _bc_do_sub
> |
> |--99.94%-- bc_add
> | execute
> | run_code
> | yyparse
> | main
> | __libc_start_main
> | _start
> --0.06%-- [...]
>
> 0.46% bc libc-2.17.so [.] __memset_sse2
> |
> --- __memset_sse2
> |
> |--54.13%-- bc_new_num
> | |
> | |--51.00%-- bc_divide
> | | execute
> | | run_code
> | | yyparse
> | | main
> | | __libc_start_main
> | | _start
> | |
> | |--30.46%-- _bc_do_sub
> | | bc_add
> | | execute
> | | run_code
> | | yyparse
> | | main
> | | __libc_start_main
> | | _start
> | |
> | --18.55%-- _bc_do_add
> | bc_add
> | execute
> | run_code
> | yyparse
> | main
> | __libc_start_main
> | _start
> |
> --45.87%-- bc_divide
> execute
> run_code
> yyparse
> main
> __libc_start_main
> _start
>
> If this feature is disabled, perf report output looks like:
> 50.49% bc bc [.] bc_divide
> |
> --- bc_divide
>
> 33.57% bc bc [.] _one_mult
> |
> --- _one_mult
>
> 7.61% bc bc [.] _bc_do_add
> |
> --- _bc_do_add
> 0x2000186a8
>
> 6.88% bc bc [.] _bc_do_sub
> |
> --- _bc_do_sub
>
> 0.42% bc libc-2.17.so [.] __memcpy_ssse3_back
> |
> --- __memcpy_ssse3_back
>
> The LBR call stack has following known limitations
> - Zero length calls are not filtered out by hardware
> - Exception handing such as setjmp/longjmp will have calls/returns not
> match
> - Pushing different return address onto the stack will have calls/returns
> not match
> - If callstack is deeper than the LBR, only the last entries are captured
>
> Changes since v1
> - split change into more patches
> - introduce context switch callback and use it to flush LBR
> - use the context switch callback to save/restore LBR
> - dynamic allocate memory area for storing LBR stack, always switch the
> memory area during context switch
> - disable this feature by default
> - more description in change logs
>
> Changes since v2
> - don't use xchg to switch PMU specific data
> - remove nr_branch_stack from struct perf_event_context
> - simplify the save/restore LBR stack logical
> - remove unnecessary 'has_branch_stack -> needs_branch_stack'
> conversion
> - more description in change logs
>
> Changes since v3
> - remove sysfs attribute file that disable this feature
>
> Changes since v4
> - re-organize code that save/resotre LBR stack
> - allocate pmu specific data when it's needed
> - update code comments
>
> These patches are also available at:
>
> These patches are also available at:
> https://github.com/ukernel/linux.git perf-lbr-callstack
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the
> body of a message to majordomo@...r.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists