linux-kernel - Re: [PATCH 3/8] perf record: Add BPF event filter support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM9d7ch9QAoernuHFY3pC0zr+ROHS_m1WquNUrcC7E15otBDnw@mail.gmail.com>
Date:   Tue, 7 Mar 2023 13:05:56 -0800
From:   Namhyung Kim <namhyung@...nel.org>
To:     Adrian Hunter <adrian.hunter@...el.com>
Cc:     Arnaldo Carvalho de Melo <acme@...nel.org>,
        Jiri Olsa <jolsa@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Ian Rogers <irogers@...gle.com>,
        Andi Kleen <ak@...ux.intel.com>,
        Kan Liang <kan.liang@...ux.intel.com>,
        Song Liu <song@...nel.org>,
        Stephane Eranian <eranian@...gle.com>,
        Ravi Bangoria <ravi.bangoria@....com>,
        Leo Yan <leo.yan@...aro.org>,
        James Clark <james.clark@....com>, Hao Luo <haoluo@...gle.com>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-perf-users@...r.kernel.org, bpf@...r.kernel.org
Subject: Re: [PATCH 3/8] perf record: Add BPF event filter support

On Tue, Mar 7, 2023 at 5:04 AM Adrian Hunter <adrian.hunter@...el.com> wrote:
>
> On 23/02/23 01:01, Namhyung Kim wrote:
> > Use --filter option to set BPF filter for generic events other than the
> > tracepoints or Intel PT.  The BPF program will check the sample data and
> > filter according to the expression.
> >
> > For example, the below is the typical perf record for frequency mode.
> > The sample period started from 1 and increased gradually.
> >
> > $ sudo ./perf record -e cycles true
> > $ sudo ./perf script
> >        perf-exec 2272336 546683.916875:          1 cycles:  ffffffff828499b8 perf_event_exec+0x298 ([kernel.kallsyms])
> >        perf-exec 2272336 546683.916892:          1 cycles:  ffffffff828499b8 perf_event_exec+0x298 ([kernel.kallsyms])
> >        perf-exec 2272336 546683.916899:          3 cycles:  ffffffff828499b8 perf_event_exec+0x298 ([kernel.kallsyms])
> >        perf-exec 2272336 546683.916905:         17 cycles:  ffffffff828499b8 perf_event_exec+0x298 ([kernel.kallsyms])
> >        perf-exec 2272336 546683.916911:        100 cycles:  ffffffff828499b8 perf_event_exec+0x298 ([kernel.kallsyms])
> >        perf-exec 2272336 546683.916917:        589 cycles:  ffffffff828499b8 perf_event_exec+0x298 ([kernel.kallsyms])
> >        perf-exec 2272336 546683.916924:       3470 cycles:  ffffffff828499b8 perf_event_exec+0x298 ([kernel.kallsyms])
> >        perf-exec 2272336 546683.916930:      20465 cycles:  ffffffff828499b8 perf_event_exec+0x298 ([kernel.kallsyms])
> >             true 2272336 546683.916940:     119873 cycles:  ffffffff8283afdd perf_iterate_ctx+0x2d ([kernel.kallsyms])
> >             true 2272336 546683.917003:     461349 cycles:  ffffffff82892517 vma_interval_tree_insert+0x37 ([kernel.kallsyms])
> >             true 2272336 546683.917237:     635778 cycles:  ffffffff82a11400 security_mmap_file+0x20 ([kernel.kallsyms])
> >
> > When you add a BPF filter to get samples having periods greater than 1000,
> > the output would look like below:
> >
> > $ sudo ./perf record -e cycles --filter 'period > 1000' true
> > $ sudo ./perf script
> >        perf-exec 2273949 546850.708501:       5029 cycles:  ffffffff826f9e25 finish_wait+0x5 ([kernel.kallsyms])
> >        perf-exec 2273949 546850.708508:      32409 cycles:  ffffffff826f9e25 finish_wait+0x5 ([kernel.kallsyms])
> >        perf-exec 2273949 546850.708526:     143369 cycles:  ffffffff82b4cdbf xas_start+0x5f ([kernel.kallsyms])
> >        perf-exec 2273949 546850.708600:     372650 cycles:  ffffffff8286b8f7 __pagevec_lru_add+0x117 ([kernel.kallsyms])
> >        perf-exec 2273949 546850.708791:     482953 cycles:  ffffffff829190de __mod_memcg_lruvec_state+0x4e ([kernel.kallsyms])
> >             true 2273949 546850.709036:     501985 cycles:  ffffffff828add7c tlb_gather_mmu+0x4c ([kernel.kallsyms])
> >             true 2273949 546850.709292:     503065 cycles:      7f2446d97c03 _dl_map_object_deps+0x973 (/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
> >
> > Signed-off-by: Namhyung Kim <namhyung@...nel.org>
> > ---
> >  tools/perf/Documentation/perf-record.txt | 15 +++++++++++---
> >  tools/perf/util/bpf_counter.c            |  3 +--
> >  tools/perf/util/evlist.c                 | 25 +++++++++++++++++-------
> >  tools/perf/util/evsel.c                  |  2 ++
> >  tools/perf/util/parse-events.c           |  8 +++-----
> >  5 files changed, 36 insertions(+), 17 deletions(-)
> >
> > diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> > index ff815c2f67e8..9f4e1337e6dc 100644
> > --- a/tools/perf/Documentation/perf-record.txt
> > +++ b/tools/perf/Documentation/perf-record.txt
> > @@ -119,9 +119,12 @@ OPTIONS
> >         "perf report" to view group events together.
> >
> >  --filter=<filter>::
> > -        Event filter. This option should follow an event selector (-e) which
> > -     selects either tracepoint event(s) or a hardware trace PMU
> > -     (e.g. Intel PT or CoreSight).
> > +     Event filter.  This option should follow an event selector (-e).
> > +     If the event is a tracepoint, the filter string will be parsed by
> > +     the kernel.  If the event is a hardware trace PMU (e.g. Intel PT
> > +     or CoreSight), it'll be processed as an address filter.  Otherwise
> > +     it means a general filter using BPF which can be applied for any
> > +     kind of events.
>
> events -> event
>
> >
> >       - tracepoint filters
> >
> > @@ -174,6 +177,12 @@ OPTIONS
> >       within a single mapping.  MMAP events (or /proc/<pid>/maps) can be
> >       examined to determine if that is a possibility.
> >
> > +     - bpf filters
> > +
> > +     BPF filter can access the sample data and make a decision based on the
>
> 'BPF filter' -> 'A BPF filter'
>
> > +     data.  Users need to set the appropriate sample type to use the BPF
>
> 'the appropriate' -> 'an appropriate'
>
> Perhaps could expand on what "appropriate sample type" means here,
> also, since the user does not actually specify sample_type directly.
>
> What happens if the sample_type is not appropriate?

Good point.  If it doesn't have an appropriate sample type, value 0
will be used and I expect it makes the filter fail.  But we should not
rely on such a behavior and request user to provide the sample type.
I think perf tool should show a warning to tell the missing options.

>
> > +     filter.
> > +
> >       Multiple filters can be separated with space or comma.
> >
> >  --exclude-perf::
> > diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
> > index eeee899fcf34..0414385794ee 100644
> > --- a/tools/perf/util/bpf_counter.c
> > +++ b/tools/perf/util/bpf_counter.c
> > @@ -781,8 +781,7 @@ extern struct bpf_counter_ops bperf_cgrp_ops;
> >
> >  static inline bool bpf_counter_skip(struct evsel *evsel)
> >  {
> > -     return list_empty(&evsel->bpf_counter_list) &&
> > -             evsel->follower_skel == NULL;
> > +     return evsel->bpf_counter_ops == NULL;
> >  }
> >
> >  int bpf_counter__install_pe(struct evsel *evsel, int cpu_map_idx, int fd)
> > diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> > index 817df2504a1e..1ae047b24c89 100644
> > --- a/tools/perf/util/evlist.c
> > +++ b/tools/perf/util/evlist.c
> > @@ -31,6 +31,7 @@
> >  #include "util/evlist-hybrid.h"
> >  #include "util/pmu.h"
> >  #include "util/sample.h"
> > +#include "util/bpf-filter.h"
> >  #include <signal.h>
> >  #include <unistd.h>
> >  #include <sched.h>
> > @@ -1086,17 +1087,27 @@ int evlist__apply_filters(struct evlist *evlist, struct evsel **err_evsel)
> >       int err = 0;
> >
> >       evlist__for_each_entry(evlist, evsel) {
> > -             if (evsel->filter == NULL)
> > -                     continue;
> > -
> >               /*
> >                * filters only work for tracepoint event, which doesn't have cpu limit.
> >                * So evlist and evsel should always be same.
> >                */
> > -             err = perf_evsel__apply_filter(&evsel->core, evsel->filter);
> > -             if (err) {
> > -                     *err_evsel = evsel;
> > -                     break;
> > +              if (evsel->filter) {
>
> Extra space before 'if'

Oops, will fix.

Thanks,
Namhyung


>
> > +                     err = perf_evsel__apply_filter(&evsel->core, evsel->filter);
> > +                     if (err) {
> > +                             *err_evsel = evsel;
> > +                             break;
> > +                     }
> > +             }
> > +
> > +             /*
> > +              * non-tracepoint events can have BPF filters.
> > +              */
> > +             if (!list_empty(&evsel->bpf_filters)) {
> > +                     err = perf_bpf_filter__prepare(evsel);
> > +                     if (err) {
> > +                             *err_evsel = evsel;
> > +                             break;
> > +                     }
> >               }
> >       }
> >
> > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > index 51e8ce6edddc..cae624fde026 100644
> > --- a/tools/perf/util/evsel.c
> > +++ b/tools/perf/util/evsel.c
> > @@ -50,6 +50,7 @@
> >  #include "off_cpu.h"
> >  #include "../perf-sys.h"
> >  #include "util/parse-branch-options.h"
> > +#include "util/bpf-filter.h"
> >  #include <internal/xyarray.h>
> >  #include <internal/lib.h>
> >  #include <internal/threadmap.h>
> > @@ -1494,6 +1495,7 @@ void evsel__exit(struct evsel *evsel)
> >       assert(list_empty(&evsel->core.node));
> >       assert(evsel->evlist == NULL);
> >       bpf_counter__destroy(evsel);
> > +     perf_bpf_filter__destroy(evsel);
> >       evsel__free_counts(evsel);
> >       perf_evsel__free_fd(&evsel->core);
> >       perf_evsel__free_id(&evsel->core);
> > diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
> > index 0336ff27c15f..4371a2bb2564 100644
> > --- a/tools/perf/util/parse-events.c
> > +++ b/tools/perf/util/parse-events.c
> > @@ -27,6 +27,7 @@
> >  #include "perf.h"
> >  #include "util/parse-events-hybrid.h"
> >  #include "util/pmu-hybrid.h"
> > +#include "util/bpf-filter.h"
> >  #include "tracepoint.h"
> >  #include "thread_map.h"
> >
> > @@ -2537,11 +2538,8 @@ static int set_filter(struct evsel *evsel, const void *arg)
> >               perf_pmu__scan_file(pmu, "nr_addr_filters",
> >                                   "%d", &nr_addr_filters);
> >
> > -     if (!nr_addr_filters) {
> > -             fprintf(stderr,
> > -                     "This CPU does not support address filtering\n");
> > -             return -1;
> > -     }
> > +     if (!nr_addr_filters)
> > +             return perf_bpf_filter__parse(&evsel->bpf_filters, str);
> >
> >       if (evsel__append_addr_filter(evsel, str) < 0) {
> >               fprintf(stderr,
>