linux-kernel - Re: [PATCH v6 3/4] perf-stat: enable counting events for BPF programs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <B5D51BCD-C9FA-458D-B8C9-345BC2AD16E6@fb.com>
Date:   Tue, 29 Dec 2020 17:46:18 +0000
From:   Song Liu <songliubraving@...com>
To:     Namhyung Kim <namhyung@...nel.org>
CC:     linux-kernel <linux-kernel@...r.kernel.org>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Mark Rutland <mark.rutland@....com>,
        Jiri Olsa <jolsa@...hat.com>,
        "Kernel Team" <Kernel-team@...com>
Subject: Re: [PATCH v6 3/4] perf-stat: enable counting events for BPF programs



> On Dec 28, 2020, at 11:22 PM, Namhyung Kim <namhyung@...nel.org> wrote:
> 
> On Tue, Dec 29, 2020 at 2:41 AM Song Liu <songliubraving@...com> wrote:
>> 
>> Introduce perf-stat -b option, which counts events for BPF programs, like:
>> 
>> [root@...alhost ~]# ~/perf stat -e ref-cycles,cycles -b 254 -I 1000
>>     1.487903822            115,200      ref-cycles
>>     1.487903822             86,012      cycles
>>     2.489147029             80,560      ref-cycles
>>     2.489147029             73,784      cycles
>>     3.490341825             60,720      ref-cycles
>>     3.490341825             37,797      cycles
>>     4.491540887             37,120      ref-cycles
>>     4.491540887             31,963      cycles
>> 
>> The example above counts cycles and ref-cycles of BPF program of id 254.
>> This is similar to bpftool-prog-profile command, but more flexible.
>> 
>> perf-stat -b creates per-cpu perf_event and loads fentry/fexit BPF
>> programs (monitor-progs) to the target BPF program (target-prog). The
>> monitor-progs read perf_event before and after the target-prog, and
>> aggregate the difference in a BPF map. Then the user space reads data
>> from these maps.
>> 
>> A new struct bpf_counter is introduced to provide common interface that
>> uses BPF programs/maps to count perf events.
>> 
>> Signed-off-by: Song Liu <songliubraving@...com>
>> ---
>> tools/perf/Makefile.perf                      |   2 +-
>> tools/perf/builtin-stat.c                     |  77 ++++-
>> tools/perf/util/Build                         |   1 +
>> tools/perf/util/bpf_counter.c                 | 296 ++++++++++++++++++
>> tools/perf/util/bpf_counter.h                 |  72 +++++
>> .../util/bpf_skel/bpf_prog_profiler.bpf.c     |  93 ++++++
>> tools/perf/util/evsel.c                       |   9 +
>> tools/perf/util/evsel.h                       |   6 +
>> tools/perf/util/stat-display.c                |   4 +-
>> tools/perf/util/stat.c                        |   2 +-
>> tools/perf/util/target.c                      |  34 +-
>> tools/perf/util/target.h                      |  10 +
>> 12 files changed, 588 insertions(+), 18 deletions(-)
>> create mode 100644 tools/perf/util/bpf_counter.c
>> create mode 100644 tools/perf/util/bpf_counter.h
>> create mode 100644 tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c
>> 
>> diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
>> index d182a2dbb9bbd..8c4e039c3b813 100644
>> --- a/tools/perf/Makefile.perf
>> +++ b/tools/perf/Makefile.perf
>> @@ -1015,7 +1015,7 @@ python-clean:
>> 
>> SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel)
>> SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
>> -SKELETONS :=
>> +SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
>> 
>> ifdef BUILD_BPF_SKEL
>> BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool
>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>> index 8cc24967bc273..09bffb3fbcdd4 100644
>> --- a/tools/perf/builtin-stat.c
>> +++ b/tools/perf/builtin-stat.c
>> @@ -67,6 +67,7 @@
>> #include "util/top.h"
>> #include "util/affinity.h"
>> #include "util/pfm.h"
>> +#include "util/bpf_counter.h"
>> #include "asm/bug.h"
>> 
>> #include <linux/time64.h>
>> @@ -409,12 +410,31 @@ static int read_affinity_counters(struct timespec *rs)
>>        return 0;
>> }
>> 
>> +static int read_bpf_map_counters(void)
>> +{
>> +       struct evsel *counter;
>> +       int err;
>> +
>> +       evlist__for_each_entry(evsel_list, counter) {
>> +               err = bpf_counter__read(counter);
>> +               if (err)
>> +                       return err;
>> +       }
>> +       return 0;
>> +}
>> +
>> static void read_counters(struct timespec *rs)
>> {
>>        struct evsel *counter;
>> +       int err;
>> 
>> -       if (!stat_config.stop_read_counter && (read_affinity_counters(rs) < 0))
>> -               return;
>> +       if (!stat_config.stop_read_counter) {
>> +               err = read_bpf_map_counters();
>> +               if (err == -EAGAIN)
>> +                       err = read_affinity_counters(rs);
> 
> Instead of checking the error code, can we do something like
> 
>  if (target__has_bpf(target))
>      read_bpf_map_counters();
> 
> ?

Yeah, we can do that. 

> 
>> +               if (err < 0)
>> +                       return;
>> +       }
>> 
>>        evlist__for_each_entry(evsel_list, counter) {
>>                if (counter->err)
>> @@ -496,11 +516,20 @@ static bool handle_interval(unsigned int interval, int *times)
>>        return false;
>> }
>> 
>> -static void enable_counters(void)
>> +static int enable_counters(void)
>> {
>> +       struct evsel *evsel;
>> +       int err;
>> +
>> +       evlist__for_each_entry(evsel_list, evsel) {
>> +               err = bpf_counter__enable(evsel);
>> +               if (err)
>> +                       return err;
> 
> Ditto.

For this one, we still need to check the return value, as bpf_counter__enable()
may fail. We can add a global check to skip the loop. 

> 
>> +       }
>> +

[...]

>> +
>> +#include "bpf_skel/bpf_prog_profiler.skel.h"
>> +
>> +static inline void *u64_to_ptr(__u64 ptr)
>> +{
>> +       return (void *)(unsigned long)ptr;
>> +}
>> +
>> +static void set_max_rlimit(void)
>> +{
>> +       struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
>> +
>> +       setrlimit(RLIMIT_MEMLOCK, &rinf);
>> +}
> 
> This looks scary..

I guess this is OK as we requires root rights for -b?

> 

[...]
>> +       if (!counter) {
>> +               close(prog_fd);
>> +               return -1;
>> +       }
>> +
>> +       skel = bpf_prog_profiler_bpf__open();
>> +       if (!skel) {
>> +               pr_err("Failed to open bpf skeleton\n");
>> +               goto err_out;
>> +       }
>> +       skel->rodata->num_cpu = evsel__nr_cpus(evsel);
>> +
>> +       bpf_map__resize(skel->maps.events, evsel__nr_cpus(evsel));
>> +       bpf_map__resize(skel->maps.fentry_readings, 1);
>> +       bpf_map__resize(skel->maps.accum_readings, 1);
>> +
>> +       prog_name = bpf_target_prog_name(prog_fd);
>> +       if (!prog_name) {
>> +               pr_err("Failed to get program name for bpf prog %u. Does it have BTF?\n", prog_id);
>> +               goto err_out;
>> +       }
>> +
>> +       bpf_object__for_each_program(prog, skel->obj) {
>> +               err = bpf_program__set_attach_target(prog, prog_fd, prog_name);
>> +               if (err) {
>> +                       pr_err("bpf_program__set_attach_target failed.\n"
>> +                              "Does bpf prog %u have BTF?\n", prog_id);
>> +                       goto err_out;
>> +               }
>> +       }
>> +       set_max_rlimit();
>> +       err = bpf_prog_profiler_bpf__load(skel);
>> +       if (err) {
>> +               pr_err("bpf_prog_profiler_bpf__load failed\n");
>> +               goto err_out;
>> +       }
>> +
>> +       counter->skel = skel;
>> +       list_add(&counter->list, &evsel->bpf_counter_list);
>> +       close(prog_fd);
>> +       return 0;
>> +err_out:
>> +       free(counter);
>> +       close(prog_fd);
> 
> I don't know how the 'skel' part is managed, is it safe to leave?

Good catch! We do have bpf_program_profiler__destroy() in bpf_program_profiler__load().
But I should have counter->skel = skel in err path. Will fix. 

> 
> 
>> +       return -1;
>> +}
>> +
>> +static int bpf_program_profiler__load(struct evsel *evsel, struct target *target)
>> +{
>> +       char *bpf_str, *bpf_str_, *tok, *saveptr = NULL, *p;
>> +       u32 prog_id;
>> +       int ret;
>> +
>> +       bpf_str_ = bpf_str = strdup(target->bpf_str);
>> +       if (!bpf_str)
>> +               return -1;
>> +
>> +       while ((tok = strtok_r(bpf_str, ",", &saveptr)) != NULL) {
>> +               prog_id = strtoul(tok, &p, 10);
>> +               if (prog_id == 0 || prog_id == UINT_MAX ||
>> +                   (*p != '\0' && *p != ',')) {
>> +                       pr_err("Failed to parse bpf prog ids %s\n",
>> +                              target->bpf_str);
>> +                       return -1;
>> +               }
>> +
>> +               ret = bpf_program_profiler_load_one(evsel, prog_id);
>> +               if (ret) {
>> +                       bpf_program_profiler__destroy(evsel);
>> +                       free(bpf_str_);
>> +                       return -1;
>> +               }
>> +               bpf_str = NULL;
>> +       }
>> +       free(bpf_str_);
>> +       return 0;
>> +}
>> +
>> +static int bpf_program_profiler__enable(struct evsel *evsel)
>> +{
>> +       struct bpf_counter *counter;
>> +       int ret;
>> +
>> +       list_for_each_entry(counter, &evsel->bpf_counter_list, list) {
>> +               ret = bpf_prog_profiler_bpf__attach(counter->skel);
>> +               if (ret) {
>> +                       bpf_program_profiler__destroy(evsel);
>> +                       return ret;
>> +               }
>> +       }
>> +       return 0;
>> +}
>> +
>> +static int bpf_program_profiler__read(struct evsel *evsel)
>> +{
>> +       int num_cpu = evsel__nr_cpus(evsel);
>> +       struct bpf_perf_event_value values[num_cpu];
>> +       struct bpf_counter *counter;
>> +       int reading_map_fd;
>> +       __u32 key = 0;
>> +       int err, cpu;
>> +
>> +       if (list_empty(&evsel->bpf_counter_list))
>> +               return -EAGAIN;
>> +
>> +       for (cpu = 0; cpu < num_cpu; cpu++) {
>> +               perf_counts(evsel->counts, cpu, 0)->val = 0;
>> +               perf_counts(evsel->counts, cpu, 0)->ena = 0;
>> +               perf_counts(evsel->counts, cpu, 0)->run = 0;
>> +       }
> 
> Hmm.. not sure it's correct to reset counters here.

Yeah, we need to reset the user space values here. Otherwise, the later aggregation
would give wrong number. 

> 
> 
>> +       list_for_each_entry(counter, &evsel->bpf_counter_list, list) {
>> +               struct bpf_prog_profiler_bpf *skel = counter->skel;
>> +
>> +               reading_map_fd = bpf_map__fd(skel->maps.accum_readings);
>> +
>> +               err = bpf_map_lookup_elem(reading_map_fd, &key, values);
>> +               if (err) {
>> +                       fprintf(stderr, "failed to read value\n");
>> +                       return err;
>> +               }
>> +
>> +               for (cpu = 0; cpu < num_cpu; cpu++) {
>> +                       perf_counts(evsel->counts, cpu, 0)->val += values[cpu].counter;
>> +                       perf_counts(evsel->counts, cpu, 0)->ena += values[cpu].enabled;
>> +                       perf_counts(evsel->counts, cpu, 0)->run += values[cpu].running;
>> +               }
>> +       }
> 
> So this just aggregates all the counters in BPF programs, right?

Yes. 

> 
> 
>> +       return 0;
>> +}
>> +
>> +static int bpf_program_profiler__install_pe(struct evsel *evsel, int cpu,
>> +                                           int fd)
>> +{
>> +       struct bpf_prog_profiler_bpf *skel;
>> +       struct bpf_counter *counter;
>> +       int ret;
>> +
>> +       list_for_each_entry(counter, &evsel->bpf_counter_list, list) {
>> +               skel = counter->skel;
>> +               ret = bpf_map_update_elem(bpf_map__fd(skel->maps.events),
>> +                                         &cpu, &fd, BPF_ANY);
>> +               if (ret)
>> +                       return ret;
>> +       }
>> +       return 0;
>> +}
>> +
>> +struct bpf_counter_ops bpf_program_profiler_ops = {
>> +       .load       = bpf_program_profiler__load,
>> +       .enable     = bpf_program_profiler__enable,
>> +       .read       = bpf_program_profiler__read,
>> +       .destroy    = bpf_program_profiler__destroy,
>> +       .install_pe = bpf_program_profiler__install_pe,
> 
> What is 'pe'?

pe here means perf_event. 

> 
> Btw, do you think other kinds of bpf programs are added later?
> It seems 'perf stat -b' is somewhat coupled with this profiler ops.
> Will it be possible to run other ops in a same evsel?

It will be possible to add other ops. I have some idea of using BPF programs in
other perf scenarios. 

To clarify, I think each instance of evsel should only have one ops attached. 
And each session of perf-stat should only use one kind of ops. 

> 
>> 

[...]

>> +static inline bool target__has_bpf(struct target *target)
>> +{
>> +       return target->bpf_str;
>> +}
>> +
>> static inline bool target__none(struct target *target)
>> {
>>        return !target__has_task(target) && !target__has_cpu(target);
> 
> Shouldn't it have && !target__has_bpf() too?

Will fix. 

Thanks,
Song