linux-kernel - Re: LBR callchains from tracepoints

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160426182032.GJ11033@kernel.org>
Date:	Tue, 26 Apr 2016 15:20:32 -0300
From:	Arnaldo Carvalho de Melo <acme@...nel.org>
To:	Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Stephane Eranian <eranian@...gle.com>,
	David Ahern <dsahern@...il.com>,
	Milian Wolff <milian.wolff@...b.com>,
	Frédéric Weisbecker <fweisbec@...il.com>,
	Ingo Molnar <mingo@...nel.org>,
	Namhyung Kim <namhyung@...nel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: LBR callchains from tracepoints

Em Tue, Apr 26, 2016 at 10:26:32AM -0700, Alexei Starovoitov escreveu:
> On Tue, Apr 26, 2016 at 06:38:28PM +0200, Peter Zijlstra wrote:
> > On Mon, Apr 25, 2016 at 10:24:31PM -0300, Arnaldo Carvalho de Melo wrote:
> > > Em Mon, Apr 25, 2016 at 10:03:58PM -0300, Arnaldo Carvalho de Melo escreveu:
> > > > I now need to continue investigation why this doesn't seem to work from
> > > > tracepoints...

> > > Bummer, the changeset (at the end of this message) hasn't any
> > > explanation, is this really impossible? I.e. LBR callstacks from
> > > tracepoints? Even if we set perf_event_attr.exclude_callchain_kernel?

> > Could maybe be done, but its tricky to implement as the LBR is managed
> > by the hardware PMU and tracepoints are a software PMU, so we need to
> > then somehow frob with cross-pmu resources, in a very arch specific way.
> > And programmability of the hardware PMU will then depend on events
> > outside of it.
 
> btw we're thinking to add support for lbr to bpf, so that from the program
> we can get accurate and fast stacks. That's especially important for user
> space stacks. No clear idea how to do it yet, but it would be really useful.

Yeah, and that already works in perf, its just that it doesn't work from some
points (PERF_TYPE_SOFTWARE, PERF_TYPE_TRACEPOINT, etc), as described in the
changeset I mentioned.

'perf trace --call-graph lbr' doesn't work right now even with it
interested only in the user space bits, i.e. setting
perf_event_attr.exclude_callchain_kernel.

  # perf trace --call-graph dwarf

works, but that, as you mention, really isn't "fast" and sometimes not
accurate, or at least wasn't with broken toolchains.

Example of mixed strace-like with userspace-only DWARF callchains (would be
lovely if this was with LBR, huh?) plus fp callchains for the
sched:sched_switch tracepoint plus LBR callchains for a hardware event, cycles,
look further below for the reason of the broken timestamps for
PERF_TYPE_HARDWARE events:

  # perf trace -e nanosleep --event sched:sched_switch/call-graph=fp/ --ev cycles/call-graph=lbr,period=100/ usleep 1
18446744073709.551 (         ): cycles/call-graph=lbr,period=100/:)
                                       __intel_pmu_enable_all+0xfe200080 ([kernel.kallsyms])
                                       intel_pmu_enable_all+0xfe200010 ([kernel.kallsyms])
                                       x86_pmu_enable+0xfe200271 ([kernel.kallsyms])
                                       perf_pmu_enable.part.81+0xfe200007 ([kernel.kallsyms])
                                       ctx_resched+0xfe20007a ([kernel.kallsyms])
                                       perf_event_exec+0xfe20011d ([kernel.kallsyms])
                                       setup_new_exec+0xfe20006f ([kernel.kallsyms])
                                       load_elf_binary+0xfe2003e3 ([kernel.kallsyms])
                                       search_binary_handler+0xfe20009e ([kernel.kallsyms])
                                       do_execveat_common.isra.38+0xfe20052c ([kernel.kallsyms])
                                       sys_execve+0xfe20003a ([kernel.kallsyms])
                                       do_syscall_64+0xfe200062 ([kernel.kallsyms])
                                       return_from_SYSCALL_64+0xfe200000 ([kernel.kallsyms])
                                       [0] ([unknown])
     0.310 ( 0.006 ms): usleep/20951 nanosleep(rqtp: 0x7ffda8904500       ) ...
     0.310 (         ): sched:sched_switch:usleep:20951 [120] S ==> swapper/3:0 [120])
                                       __schedule+0xfe200402 ([kernel.kallsyms])
                                       schedule+0xfe200035 ([kernel.kallsyms])
                                       do_nanosleep+0xfe20006f ([kernel.kallsyms])
                                       hrtimer_nanosleep+0xfe2000dc ([kernel.kallsyms])
                                       sys_nanosleep+0xfe20007a ([kernel.kallsyms])
                                       do_syscall_64+0xfe200062 ([kernel.kallsyms])
                                       return_from_SYSCALL_64+0xfe200000 ([kernel.kallsyms])
                                       __nanosleep+0xffff00bfad62c010 (/usr/lib64/libc-2.22.so)
18446679523046.461 (         ): cycles/call-graph=lbr,period=100/:)
                                       perf_pmu_enable.part.81+0xfe200007 ([kernel.kallsyms])
                                       __perf_event_task_sched_in+0xfe2001ad ([kernel.kallsyms])
                                       finish_task_switch+0xfe200156 ([kernel.kallsyms])
                                       __schedule+0xfe200397 ([kernel.kallsyms])
                                       schedule+0xfe200035 ([kernel.kallsyms])
                                       do_nanosleep+0xfe20006f ([kernel.kallsyms])
                                       hrtimer_nanosleep+0xfe2000dc ([kernel.kallsyms])
                                       sys_nanosleep+0xfe20007a ([kernel.kallsyms])
                                       do_syscall_64+0xfe200062 ([kernel.kallsyms])
                                       return_from_SYSCALL_64+0xfe200000 ([kernel.kallsyms])
                                       [0] ([unknown])
     0.377 ( 0.073 ms): usleep/20951  ... [continued]: nanosleep()) = 0
  [root@...et ~]# 


perf_event_attr:
  type                             0 (PERF_TYPE_HARDWARE)
  config                           0 (PERF_COUNT_HW_CPU_CYCLES)
  size                             112
  { sample_period, sample_freq }   100
  sample_type                      IP|TID|CALLCHAIN|BRANCH_STACK|IDENTIFIER
  read_format                      ID
  disabled                         1
  inherit                          1
  enable_on_exec                   1
  sample_id_all                    1
  exclude_guest                    1
  { wakeup_events, wakeup_watermark } 1
  branch_sample_type               USER|CALL_STACK|NO_FLAGS|NO_CYCLES

missing PERF_SAMPLE_TIME, will fix.

- Arnaldo