linux-kernel - Re: [PATCH 00/13] perf_events: add support for sampling taken branches (v3)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 23 Jan 2012 11:14:37 +0100
From:	Stephane Eranian <eranian@...gle.com>
To:	linux-kernel@...r.kernel.org
Cc:	peterz@...radead.org, mingo@...e.hu, acme@...radead.org,
	robert.richter@....com, ming.m.lin@...el.com, andi@...stfloor.org,
	asharma@...com, ravitillo@....gov, vweaver1@...s.utk.edu
Subject: Re: [PATCH 00/13] perf_events: add support for sampling taken
 branches (v3)

Any comments on this patch set?


On Mon, Jan 9, 2012 at 5:49 PM, Stephane Eranian <eranian@...gle.com> wrote:
>
> This patchset adds an important and useful new feature to
> perf_events: branch stack sampling. In other words, the
> ability to capture taken branches into each sample.
>
> Statistical sampling of taken branch should not be confused
> for branch tracing. Not all branches are necessarily captured
>
> Sampling taken branches is important for basic block profiling,
> statistical call graph, function call counts. Many of those
> measurements can help drive a compiler optimizer.
>
> The branch stack is a software abstraction which sits on top
> of the PMU hardware. As such, it is not available on all
> processors. For now, the patch provides the generic interface
> and the Intel X86 implementation where it leverages the Last
> Branch Record (LBR) feature (from Core2 to SandyBridge).
>
> Branch stack sampling is supported for both per-thread and
> system-wide modes.
>
> It is possible to filter the type and privilege level of branches
> to sample. The target of the branch is used to determine
> the privilege level.
>
> For each branch, the source and destination are captured. On
> some hardware platforms, it may be possible to also extract
> the target prediction and, in that case, it is also exposed
> to end users.
>
> The branch stack can record a variable number of taken
> branches per sample. Those branches are always consecutive
> in time. The number of branches captured depends on the
> filtering and the underlying hardware. On Intel Nehalem
> and later, up to 16 consecutive branches can be captured
> per sample.
>
> Branch sampling is always coupled with an event. It can
> be any PMU event but it can't be a SW or tracepoint event.
>
> Branch sampling is requested by setting a new sample_type
> flag called: PERF_SAMPLE_BRANCH_STACK.
>
> To support branch filtering, we introduce a new field
> to the perf_event_attr struct: branch_sample_type. We chose
> NOT to overload the config1, config2 field because those
> are related to the event encoding. Branch stack is a
> separate feature which is combined with the event.
>
> The branch_sample_type is a bitmask of possible filters.
> The following filters are defined (more can be added):
> - PERF_SAMPLE_BRANCH_ANY     : any control flow change
> - PERF_SAMPLE_BRANCH_USER    : capture branches when target is at user level
> - PERF_SAMPLE_BRANCH_KERNEL  : capture branches when target is at user level
> - PERF_SAMPLE_BRANCH_ANY_CALL: capture call branches (incl. syscalls)
> - PERF_SAMPLE_BRANCH_ANY_RET : capture return branches (incl. syscall returns)
> - PERF_SAMPLE_BRANCH_IND_CALL: capture indirect calls
>
> It is possible to combine filters, e.g., IND_CALL|USER|KERNEL.
>
> When the privilege level is not specified, the branch stack
> inherits that of the associated event.
>
> Some processors may not offer hardware branch filtering, e.g., Intel
> Atom. Some may have HW filtering bugs (e.g., Nehalem). The Intel
> X86 implementation in this patchset also provides a SW branch filter
> which works on a best effort basis. It can compensate for the lack
> of LBR filtering. But first and foremost, it helps work around LBR
> filtering errata. The goal is to only capture the type of branches
> requested by the user.
>
> It is possible to combine branch stack sampling with PEBS on Intel
> X86 processors. Depending on the precise_sampling mode, there are
> certain filterting restrictions. When precise_sampling=1, then
> there are no filtering restrictions. When precise_sampling > 1,
> then only ANY|USER|KERNEL filter can be used. This comes from
> the fact that the kernel uses LBR to compensate for the PEBS
> off-by-1 skid on the instruction pointer.
>
> To demonstrate how the perf_event branch stack sampling interface
> works, the patchset also modifies perf record to capture taken
> branches. Similarly perf report is enhanced to display a histogram
> of taken branches.
>
> I would like to thank Roberto Vitillo @ LBL for his work on the perf
> tool for this.
>
> Enough talking, let's take a simple example. Our trivial test program
> goes like this:
>
> void f2(void)
> {}
> void f3(void)
> {}
> void f1(unsigned long n)
> {
>  if (n & 1UL)
>    f2();
>  else
>    f3();
> }
> int main(void)
> {
>  unsigned long i;
>
>  for (i=0; i < N; i++)
>   f1(i);
>  return 0;
> }
>
> $ perf record -b any branchy
> $ perf report -b
> # Events: 23K cycles
> #
> # Overhead  Source Symbol     Target Symbol
> # ........  ................  ................
>
>    18.13%  [.] f1            [.] main
>    18.10%  [.] main          [.] main
>    18.01%  [.] main          [.] f1
>    15.69%  [.] f1            [.] f1
>     9.11%  [.] f3            [.] f1
>     6.78%  [.] f1            [.] f3
>     6.74%  [.] f1            [.] f2
>     6.71%  [.] f2            [.] f1
>
> Of the total number of branches captured, 18.13% were from f1() -> main().
>
> Let's make this clearer by filtering the user call branches only:
>
> $ perf record -b any_call -e cycles:u branchy
> $ perf report -b
> # Events: 19K cycles
> #
> # Overhead  Source Symbol              Target Symbol
> # ........  .........................  .........................
> #
>    52.50%  [.] main                   [.] f1
>    23.99%  [.] f1                     [.] f3
>    23.48%  [.] f1                     [.] f2
>     0.03%  [.] _IO_default_xsputn     [.] _IO_new_file_overflow
>     0.01%  [k] _start                 [k] __libc_start_main
>
> Now it is more obvious. %52 of all the captured branches where calls from main() -> f1().
> The rest is split 50/50 between f1() -> f2() and f1() -> f3() which is expected given
> that f1() dispatches based on odd vs. even values of n which is constantly increasing.
>
>
> Here is a kernel example, where we want to sample indirect calls:
> $ perf record -a -C 1 -b ind_call -e r1c4:k sleep 10
> $ perf report -b
> #
> # Overhead  Source Symbol               Target Symbol
> # ........  ..........................  ..........................
> #
>    36.36%  [k] __delay                 [k] delay_tsc
>     9.09%  [k] ktime_get               [k] read_tsc
>     9.09%  [k] getnstimeofday          [k] read_tsc
>     9.09%  [k] notifier_call_chain     [k] tick_notify
>     4.55%  [k] cpuidle_idle_call       [k] intel_idle
>     4.55%  [k] cpuidle_idle_call       [k] menu_reflect
>     2.27%  [k] handle_irq              [k] handle_edge_irq
>     2.27%  [k] ack_apic_edge           [k] native_apic_mem_write
>     2.27%  [k] hpet_interrupt_handler  [k] hrtimer_interrupt
>     2.27%  [k] __run_hrtimer           [k] watchdog_timer_fn
>     2.27%  [k] enqueue_task            [k] enqueue_task_rt
>     2.27%  [k] try_to_wake_up          [k] select_task_rq_rt
>     2.27%  [k] do_timer                [k] read_tsc
>
> Due to HW limitations, branch filtering may be approximate on
> Core, Atom processors. It is more accurate on Nehalem, Westmere
> and best on Sandy Bridge.
>
> In version 2, we've updated the patch to tip/master (commit 5734857) and
> we've incoporated the feedback from v1 concerning anynous bitfield
> struct for branch_stack_entry and the hanlding of i386 ABI binaries
> on 64-bit host in the instr decoder for the LBR SW filter.
>
> In version 3, we've updated to 3.2.0-tip. The Atom revision
> check has been put into its own patch. We fixed a browser
> issue with report report. We fixed all the style issues as well.
>
> Signed-off-by: Stephane Eranian <eranian@...gle.com>
> ---
>
> Roberto Agostino Vitillo (3):
>  perf: add code to support PERF_SAMPLE_BRANCH_STACK
>  perf: add support for sampling taken branch to perf record
>  perf: add support for taken branch sampling to perf report
>
> Stephane Eranian (10):
>  perf_events: add generic taken branch sampling support (v3)
>  perf_events: add Intel LBR MSR definitions
>  perf_events: add Intel X86 LBR sharing logic
>  perf_events: sync branch stack sampling with X86 precise_sampling
>  perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters
>  perf_events: disable LBR support for older Intel Atom processors
>  perf_events: implement PERF_SAMPLE_BRANCH for Intel X86
>  perf_events: add LBR software filter support for Intel X86
>  perf_events: disable PERF_SAMPLE_BRANCH_* when not supported
>  perf_events: add hook to flush branch_stack on context switch
>
>  arch/alpha/kernel/perf_event.c             |    4 +
>  arch/arm/kernel/perf_event.c               |    4 +
>  arch/mips/kernel/perf_event_mipsxx.c       |    4 +
>  arch/powerpc/kernel/perf_event.c           |    4 +
>  arch/sh/kernel/perf_event.c                |    4 +
>  arch/sparc/kernel/perf_event.c             |    4 +
>  arch/x86/include/asm/msr-index.h           |    7 +
>  arch/x86/kernel/cpu/perf_event.c           |   47 +++-
>  arch/x86/kernel/cpu/perf_event.h           |   19 +
>  arch/x86/kernel/cpu/perf_event_amd.c       |    3 +
>  arch/x86/kernel/cpu/perf_event_intel.c     |  120 +++++--
>  arch/x86/kernel/cpu/perf_event_intel_ds.c  |   22 +-
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c |  525 ++++++++++++++++++++++++++--
>  include/linux/perf_event.h                 |   78 ++++-
>  kernel/events/core.c                       |  167 +++++++++
>  kernel/events/hw_breakpoint.c              |    6 +
>  tools/perf/Documentation/perf-record.txt   |   18 +
>  tools/perf/Documentation/perf-report.txt   |    7 +
>  tools/perf/builtin-record.c                |   69 ++++
>  tools/perf/builtin-report.c                |   95 +++++-
>  tools/perf/perf.h                          |   18 +
>  tools/perf/util/annotate.c                 |    2 +-
>  tools/perf/util/event.h                    |    1 +
>  tools/perf/util/evsel.c                    |   14 +
>  tools/perf/util/hist.c                     |   93 ++++-
>  tools/perf/util/hist.h                     |    7 +
>  tools/perf/util/session.c                  |   72 ++++
>  tools/perf/util/session.h                  |    4 +
>  tools/perf/util/sort.c                     |  361 ++++++++++++++-----
>  tools/perf/util/sort.h                     |    5 +
>  tools/perf/util/symbol.h                   |   13 +
>  31 files changed, 1601 insertions(+), 196 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/