linux-kernel - Re: [PATCH v4 00/18] perf: add support for sampling taken branches

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CABPqkBSJcUDHLCvi=iuvmUFPb0L7p-Mr+pASzKun+a9=Z3j36A@mail.gmail.com>
Date:	Thu, 2 Feb 2012 14:23:13 +0100
From:	Stephane Eranian <eranian@...gle.com>
To:	Anshuman Khandual <khandual@...ux.vnet.ibm.com>
Cc:	linux-kernel@...r.kernel.org, peterz@...radead.org, mingo@...e.hu,
	acme@...hat.com, robert.richter@....com, ming.m.lin@...el.com,
	andi@...stfloor.org, asharma@...com, ravitillo@....gov,
	vweaver1@...s.utk.edu, dsahern@...il.com
Subject: Re: [PATCH v4 00/18] perf: add support for sampling taken branches

On Wed, Feb 1, 2012 at 9:41 AM, Anshuman Khandual
<khandual@...ux.vnet.ibm.com> wrote:
> On Saturday 28 January 2012 02:26 AM, Stephane Eranian wrote:
>> This patchset adds an important and useful new feature to
>> perf_events: branch stack sampling. In other words, the
>> ability to capture taken branches into each sample.
>>
>> Statistical sampling of taken branch should not be confused
>> for branch tracing. Not all branches are necessarily captured
>>
>> Sampling taken branches is important for basic block profiling,
>> statistical call graph, function call counts. Many of those
>> measurements can help drive a compiler optimizer.
>>
>> The branch stack is a software abstraction which sits on top
>> of the PMU hardware. As such, it is not available on all
>> processors. For now, the patch provides the generic interface
>> and the Intel X86 implementation where it leverages the Last
>> Branch Record (LBR) feature (from Core2 to SandyBridge).
>>
>> Branch stack sampling is supported for both per-thread and
>> system-wide modes.
>>
>> It is possible to filter the type and privilege level of branches
>> to sample. The target of the branch is used to determine
>> the privilege level.
>>
>> For each branch, the source and destination are captured. On
>> some hardware platforms, it may be possible to also extract
>> the target prediction and, in that case, it is also exposed
>> to end users.
>>
>> The branch stack can record a variable number of taken
>> branches per sample. Those branches are always consecutive
>> in time. The number of branches captured depends on the
>> filtering and the underlying hardware. On Intel Nehalem
>> and later, up to 16 consecutive branches can be captured
>> per sample.
>>
>> Branch sampling is always coupled with an event. It can
>> be any PMU event but it can't be a SW or tracepoint event.
>>
>> Branch sampling is requested by setting a new sample_type
>> flag called: PERF_SAMPLE_BRANCH_STACK.
>>
>> To support branch filtering, we introduce a new field
>> to the perf_event_attr struct: branch_sample_type. We chose
>> NOT to overload the config1, config2 field because those
>> are related to the event encoding. Branch stack is a
>> separate feature which is combined with the event.
>>
>> The branch_sample_type is a bitmask of possible filters.
>> The following filters are defined (more can be added):
>> - PERF_SAMPLE_BRANCH_ANY     : any control flow change
>> - PERF_SAMPLE_BRANCH_USER    : branches when target is at user level
>> - PERF_SAMPLE_BRANCH_KERNEL  : branches when target is at kernel level
>> - PERF_SAMPLE_BRANCH_HV      : branches when target is at hypervisor level
>> - PERF_SAMPLE_BRANCH_ANY_CALL: call branches (incl. syscalls)
>> - PERF_SAMPLE_BRANCH_ANY_RET : return branches (incl. syscall returns)
>> - PERF_SAMPLE_BRANCH_IND_CALL: indirect calls
>>
>> It is possible to combine filters, e.g., IND_CALL|USER|KERNEL.
>>
>> When the privilege level is not specified, the branch stack
>> inherits that of the associated event.
>>
>> Some processors may not offer hardware branch filtering, e.g., Intel
>> Atom. Some may have HW filtering bugs (e.g., Nehalem). The Intel
>> X86 implementation in this patchset also provides a SW branch filter
>> which works on a best effort basis. It can compensate for the lack
>> of LBR filtering. But first and foremost, it helps work around LBR
>> filtering errata. The goal is to only capture the type of branches
>> requested by the user.
>>
>> It is possible to combine branch stack sampling with PEBS on Intel
>> X86 processors. Depending on the precise_sampling mode, there are
>> certain filterting restrictions. When precise_sampling=1, then
>> there are no filtering restrictions. When precise_sampling > 1,
>> then only ANY|USER|KERNEL filter can be used. This comes from
>> the fact that the kernel uses LBR to compensate for the PEBS
>> off-by-1 skid on the instruction pointer.
>>
>> To demonstrate how the perf_event branch stack sampling interface
>> works, the patchset also modifies perf record to capture taken
>> branches. Similarly perf report is enhanced to display a histogram
>> of taken branches.
>>
>> I would like to thank Roberto Vitillo @ LBL for his work on the perf
>> tool for this.
>>
>> Enough talking, let's take a simple example. Our trivial test program
>> goes like this:
>>
>> void f2(void)
>> {}
>> void f3(void)
>> {}
>> void f1(unsigned long n)
>> {
>>   if (n & 1UL)
>>     f2();
>>   else
>>     f3();
>> }
>> int main(void)
>> {
>>   unsigned long i;
>>
>>   for (i=0; i < N; i++)
>>    f1(i);
>>   return 0;
>> }
>>
>> $ perf record -b any branchy
>> $ perf report -b
>> # Events: 23K cycles
>> #
>> # Overhead  Source Symbol     Target Symbol
>> # ........  ................  ................
>>
>>     18.13%  [.] f1            [.] main
>>     18.10%  [.] main          [.] main
>>     18.01%  [.] main          [.] f1
>>     15.69%  [.] f1            [.] f1
>>      9.11%  [.] f3            [.] f1
>>      6.78%  [.] f1            [.] f3
>>      6.74%  [.] f1            [.] f2
>>      6.71%  [.] f2            [.] f1
>>
>> Of the total number of branches captured, 18.13% were from f1() -> main().
>>
>> Let's make this clearer by filtering the user call branches only:
>>
>> $ perf record -b any_call -e cycles:u branchy
>> $ perf report -b
>> # Events: 19K cycles
>> #
>> # Overhead  Source Symbol              Target Symbol
>> # ........  .........................  .........................
>> #
>>     52.50%  [.] main                   [.] f1
>>     23.99%  [.] f1                     [.] f3
>>     23.48%  [.] f1                     [.] f2
>>      0.03%  [.] _IO_default_xsputn     [.] _IO_new_file_overflow
>>      0.01%  [k] _start                 [k] __libc_start_main
>>
>> Now it is more obvious. %52 of all the captured branches where calls from main() -> f1().
>> The rest is split 50/50 between f1() -> f2() and f1() -> f3() which is expected given
>> that f1() dispatches based on odd vs. even values of n which is constantly increasing.
>>
>>
>> Here is a kernel example, where we want to sample indirect calls:
>> $ perf record -a -C 1 -b ind_call -e r1c4:k sleep 10
>> $ perf report -b
>> #
>> # Overhead  Source Symbol               Target Symbol
>> # ........  ..........................  ..........................
>> #
>>     36.36%  [k] __delay                 [k] delay_tsc
>>      9.09%  [k] ktime_get               [k] read_tsc
>>      9.09%  [k] getnstimeofday          [k] read_tsc
>>      9.09%  [k] notifier_call_chain     [k] tick_notify
>>      4.55%  [k] cpuidle_idle_call       [k] intel_idle
>>      4.55%  [k] cpuidle_idle_call       [k] menu_reflect
>>      2.27%  [k] handle_irq              [k] handle_edge_irq
>>      2.27%  [k] ack_apic_edge           [k] native_apic_mem_write
>>      2.27%  [k] hpet_interrupt_handler  [k] hrtimer_interrupt
>>      2.27%  [k] __run_hrtimer           [k] watchdog_timer_fn
>>      2.27%  [k] enqueue_task            [k] enqueue_task_rt
>>      2.27%  [k] try_to_wake_up          [k] select_task_rq_rt
>>      2.27%  [k] do_timer                [k] read_tsc
>>
>
> Just wondering whether appending function call chain details to branch stack
> would add value from system performance event analysis perspective.
>

> perf record -g -b any_call,u -e branch-misses:k ls
>
Are you talking about using the content of branch_stack as a substitute
for PERF_SAMPLE_CALLCHAIN? You could, assuming you're sampling
only return branches (not call branches).

> 15.38% ls libc-2.11.1.so  libc-2.11.1.so  [k] getenv              [k] strncmp
> 15.38% ls libc-2.11.1.so  libc-2.11.1.so  [k] __execvpe           [k] strlen
> 15.38% ls libc-2.11.1.so  libc-2.11.1.so  [k] __execvpe           [k] memcpy
> 15.38% ls ld-2.11.1.so    ld-2.11.1.so    [k] _dl_map_object_from_fd  [k] mmap64
>  7.69% ls libc-2.11.1.so  libc-2.11.1.so  [k] __execvpe           [k] __strchrnul
>  7.69% ls libc-2.11.1.so  libc-2.11.1.so  [k] __execvpe           [k] __execve
>  7.69% ls ld-2.11.1.so    ld-2.11.1.so    [k] _dl_map_object_from_fd  [k] _dl_setup_hash
>  7.69% ls ld-2.11.1.so    ld-2.11.1.so    [k] _dl_map_object_from_fd  [k] close
>  7.69% ls ld-2.11.1.so    ld-2.11.1.so    [k] _dl_map_object_from_fd  [k] memset
>
> From the example above, we can see
>
> (1) 15.38%  ls  libc-2.11.1.so libc-2.11.1.so [k] getenv [k] strncmp
>
>    '[k] getenv ----> [k]' strncmp happened 15% time for the branch-misses
>     event overflow.
>
No, that's not how you have to interpret the data. It's not 15.38% of the time.
It's 15.38% of all the captured branches.

One of the goals of this first perf report mode is to show how branch_stack can
be used to statistically capture cross-module (or cross-function)
calls. In other
words, who calls who and how often. This can be used by compilers to drive
inlining, for instance. The fact that on NHM/WSM/SNB, it is possible to capture
prediction is also interesting, especially for indirect calls.

> (2) But this lacks the information from the  source code program point of view
>    like what is the code path which eventually ended up in the branch (getenv
>    ----> strncmp) 15.38% of time for the event. There can be N number of
>    function call chains which might lead to the branch (getenv ----> strncmp).
>    Having a percentage distribution of the function callchians for every entry
>    in the output (as above) would be a good idea. This would give complete
>    information (though statistical sampling) on the source code control flow
>    which would have lead to the PMU event.
>
Yes. I think what you are after is more like gprof or perf report -g, i.e., the
callgraph. You can use the branch_stack feature to collect a
statistical callgraph
without the need to frame-pointers or unwind info. You'd have to
filter on return
branches only, then you invert the edge. I think we could probably
reuse the existing
perf code to handle CALLCHAIN for this. We just haven't had a chance
to look at this
yet. But patches can be added later on.

> (3) <percentage of call_chain> <percentage of branch_chain> [EVENT]
>    There may be situations where these chains are overlapping with each other
>    to some extent.
>
> If we change to newt output format, we can display the relative percentages of call
> chains when we click on specific entry of branch chain similar to when we try to
> annotate a symbol in normal perf report newt output.
>
> Any thoughts ?
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/