linux-kernel - Re: [PATCH] perf/core: Add a tracepoint for perf sampling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Tue, 2 Aug 2016 19:44:17 -0700
From:	Brendan Gregg <bgregg@...flix.com>
To:	"Wangnan (F)" <wangnan0@...wei.com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	Arnaldo Carvalho de Melo <acme@...nel.org>,
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
	linux-kernel@...r.kernel.org, Alexei Starovoitov <ast@...nel.org>
Subject: Re: [PATCH] perf/core: Add a tracepoint for perf sampling

On Fri, Jul 29, 2016 at 8:34 PM, Wangnan (F) <wangnan0@...wei.com> wrote:
>
>
> On 2016/7/30 2:05, Brendan Gregg wrote:
>>
>> On Tue, Jul 19, 2016 at 4:20 PM, Brendan Gregg <bgregg@...flix.com> wrote:
>>>
>>> When perf is performing hrtimer-based sampling, this tracepoint can be
>>> used
>>> by BPF to run additional logic on each sample. For example, BPF can fetch
>>> stack traces and frequency count them in kernel context, for an efficient
>>> profiler.
>>
>> Any comments on this patch? Thanks,
>>
>> Brendan
>
>
> Sorry for the late.
>
> I think it is a useful feature. Could you please provide an example
> to show how to use it in perf?

Yes, the following example samples at 999 Hertz, and emits the
instruction pointer only when it is within a custom address range, as
checked by BPF. Eg:

# ./perf record -e bpf-output/no-inherit,name=evt/ \
    -e ./sampleip_range.c/map:channel.event=evt/ \
    -a ./perf record -F 999 -e cpu-clock -N -a -o /dev/null sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.000 MB /dev/null ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.134 MB perf.data (222 samples) ]

# ./perf script -F comm,pid,time,bpf-output
'bpf-output' not valid for hardware events. Ignoring.
'bpf-output' not valid for unknown events. Ignoring.
'bpf-output' not valid for unknown events. Ignoring.
              dd  6501  3058.117379:
      BPF output: 0000: 3c 4c 21 81 ff ff ff ff  <L!.....
                  0008: 00 00 00 00              ....

              dd  6501  3058.130392:
      BPF output: 0000: 55 4c 21 81 ff ff ff ff  UL!.....
                  0008: 00 00 00 00              ....

              dd  6501  3058.131393:
      BPF output: 0000: 55 4c 21 81 ff ff ff ff  UL!.....
                  0008: 00 00 00 00              ....

              dd  6501  3058.149411:
      BPF output: 0000: e1 4b 21 81 ff ff ff ff  .K!.....
                  0008: 00 00 00 00              ....

              dd  6501  3058.155417:
      BPF output: 0000: 76 4c 21 81 ff ff ff ff  vL!.....
                  0008: 00 00 00 00              ....

For that example, perf is running a BPF program to emit filtered
details, and running a second perf to configure sampling. We can
certainly improve how this works. And this will be much more
interesting once perf can emit maps, and a perf BPF program can
populate a map.

Here's sampleip_range.c:

/************************ BEGIN **************************/
#include <uapi/linux/bpf.h>
#include <uapi/linux/ptrace.h>

#define SEC(NAME) __attribute__((section(NAME), used))

/*
 * Edit the following to match the instruction address range you want to
 * sample. Eg, look in /proc/kallsyms. The addresses will change for each
 * kernel version and build.
 */
#define RANGE_START  0xffffffff81214b90
#define RANGE_END    0xffffffff81214cd0

struct bpf_map_def {
unsigned int type;
unsigned int key_size;
unsigned int value_size;
unsigned int max_entries;
};

static int (*probe_read)(void *dst, int size, void *src) =
    (void *)BPF_FUNC_probe_read;
static int (*get_smp_processor_id)(void) =
    (void *)BPF_FUNC_get_smp_processor_id;
static int (*perf_event_output)(void *, struct bpf_map_def *, int, void *,
    unsigned long) = (void *)BPF_FUNC_perf_event_output;

struct bpf_map_def SEC("maps") channel = {
.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(u32),
.max_entries = __NR_CPUS__,
};

/* from /sys/kernel/debug/tracing/events/perf/perf_hrtimer/format */
struct perf_hrtimer_args {
unsigned long long pad;
struct pt_regs *regs;
struct perf_event *event;
};
SEC("perf:perf_hrtimer")
int func(struct perf_hrtimer_args *ctx)
{
struct pt_regs regs = {};
probe_read(&regs, sizeof(regs), ctx->regs);
if (regs.ip >= RANGE_START && regs.ip < RANGE_END) {
perf_event_output(ctx, &channel, get_smp_processor_id(),
   &regs.ip, sizeof(regs.ip));
}
return 0;
}

char _license[] SEC("license") = "GPL";
int _version SEC("version") = LINUX_VERSION_CODE;
/************************* END ***************************/

>
> If I understand correctly, I can have a BPF script run 99 times per
> second using
>
>   # perf -e cpu-clock/freq=99/ -e mybpf.c ...
>
> And in mybpf.c, attach a BPF script on the new tracepoint. Right?
>
> Also, since we already have timer:hrtimer_expire_entry, please provide
> some further information about why we need a new tracepoint.

timer:hrtimer_expire_entry fires for much more than just the perf
timer. The perf:perf_hrtimer tracepoint also has registers and perf
context as arguments, which can be used for profiling programs.

Thanks for the comments,

Brendan