[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZrFAIc0G8n0-zgxt@google.com>
Date: Mon, 5 Aug 2024 14:12:01 -0700
From: Namhyung Kim <namhyung@...nel.org>
To: Arnaldo Carvalho de Melo <acme@...nel.org>
Cc: Ian Rogers <irogers@...gle.com>, Kan Liang <kan.liang@...ux.intel.com>,
Jiri Olsa <jolsa@...nel.org>,
Adrian Hunter <adrian.hunter@...el.com>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...nel.org>, LKML <linux-kernel@...r.kernel.org>,
linux-perf-users@...r.kernel.org, KP Singh <kpsingh@...nel.org>,
Song Liu <song@...nel.org>, bpf@...r.kernel.org,
Stephane Eranian <eranian@...gle.com>
Subject: Re: [PATCH] perf bpf-filter: Support multiple events properly
On Mon, Aug 05, 2024 at 04:31:34PM -0300, Arnaldo Carvalho de Melo wrote:
> On Mon, Aug 05, 2024 at 11:56:56AM -0700, Namhyung Kim wrote:
> > On Mon, Aug 05, 2024 at 12:03:14PM -0300, Arnaldo Carvalho de Melo wrote:
> > > On Fri, Aug 02, 2024 at 10:37:52AM -0700, Namhyung Kim wrote:
> > > > So far it used tgid as a key to get the filter expressions in the
> > > > pinned filters map for regular users but it won't work well if the has
> > > > more than one filters at the same time. Let's add the event id to the
> > > > key of the filter hash map so that it can identify the right filter
> > > > expression in the BPF program.
> > > >
> > > > As the event can be inherited to child tasks, it should use the primary
> > > > id which belongs to the parent (original) event. Since evsel opens the
> > > > event for multiple CPUs and tasks, it needs to maintain a separate hash
> > > > map for the event id.
> > >
> > > I'm trying to test it now, it would be nice to have the series of events
> > > needed to test that the feature is working.
> >
> > Sure, I used the following command.
> >
> > ./perf record -e cycles --filter 'ip < 0xffffffff00000000' -e instructions --filter 'period < 100000' -o- ./perf test -w noploop | ./perf script -i-
>
> Thanks
>
> > >
> > > Some comments below.
> > >
> > > > In the user space, it keeps a list for the multiple evsel and release
> > > > the entries in the both hash map when it closes the event.
> > > >
> > > > Signed-off-by: Namhyung Kim <namhyung@...nel.org>
> > > > ---
> > > > tools/perf/util/bpf-filter.c | 288 ++++++++++++++++---
> > > > tools/perf/util/bpf_skel/sample-filter.h | 11 +-
> > > > tools/perf/util/bpf_skel/sample_filter.bpf.c | 42 ++-
> > > > tools/perf/util/bpf_skel/vmlinux/vmlinux.h | 5 +
> > > > 4 files changed, 304 insertions(+), 42 deletions(-)
> > > >
> > > > diff --git a/tools/perf/util/bpf-filter.c b/tools/perf/util/bpf-filter.c
> > > > index c5eb0b7eec19..69b147cba969 100644
> > > > --- a/tools/perf/util/bpf-filter.c
> > > > +++ b/tools/perf/util/bpf-filter.c
> > > > @@ -1,4 +1,45 @@
> > > > /* SPDX-License-Identifier: GPL-2.0 */
> > > > +/**
> > > > + * Generic event filter for sampling events in BPF.
> > > > + *
> > > > + * The BPF program is fixed and just to read filter expressions in the 'filters'
> > > > + * map and compare the sample data in order to reject samples that don't match.
> > > > + * Each filter expression contains a sample flag (term) to compare, an operation
> > > > + * (==, >=, and so on) and a value.
> > > > + *
> > > > + * Note that each entry has an array of filter repxressions and it only succeeds
> > >
> > > expressions
> >
> > Oops, thanks.
> >
> > >
> > > > + * when all of the expressions are satisfied. But it supports the logical OR
> > > > + * using a GROUP operation which is satisfied when any of its member expression
> > > > + * is evaluated to true. But it doesn't allow nested GROUP operations for now.
> > > > + *
> > > > + * To support non-root users, the filters map can be loaded and pinned in the BPF
> > > > + * filesystem by root (perf record --setup-filter pin). Then each user will get
> > > > + * a new entry in the shared filters map to fill the filter expressions. And the
> > > > + * BPF program will find the filter using (task-id, event-id) as a key.
> > > > + *
> > > > + * The pinned BPF object (shared for regular users) has:
> > > > + *
> > > > + * event_hash |
> > > > + * | | |
> > > > + * event->id ---> | id | ---+ idx_hash | filters
> > > > + * | | | | | | | |
> > > > + * | .... | +-> | idx | --+--> | exprs | ---> perf_bpf_filter_entry[]
> > > > + * | | | | | | .op
> > > > + * task id (tgid) --------------+ | .... | | | ... | .term (+ part)
> > > > + * | .value
> > > > + * |
> > > > + * ======= (root would skip this part) ======== (compares it in a loop)
> > > > + *
> > > > + * This is used for per-task use cases while system-wide profiling (normally from
> > > > + * root user) uses a separate copy of the program and the maps for its own so that
> > > > + * it can proceed even if a lot of non-root users are using the filters at the
> > > > + * same time. In this case the filters map has a single entry and no need to use
> > > > + * the hash maps to get the index (key) of the filters map (IOW it's always 0).
> > > > + *
> > > > + * The BPF program returns 1 to accept the sample or 0 to drop it.
> > > > + * The 'dropped' map is to keep how many samples it dropped by the filter and
> > > > + * it will be reported as lost samples.
> > >
> > > I think there is value in reporting how many were filtered out, I'm just
> > > unsure about reporting it as "lost" samples, as lost has another
> > > semantic associated, i.e. ring buffer was full or couldn't process it
> > > for some other resource starvation issue, no?
> >
> > Then we need a way to save the information. It could be a new record
> > type (PERF_RECORD_DROPPED_SAMPLES), a new misc flag in the lost samples
>
> I guess "PERF_RECORD_FILTERED_SAMPLES" would be better, more precise,
> wdyt?
>
> > record or a header field. I prefer the misc flag.
>
> I think we can have both filtered and lost samples, so I would prefer
> the new record type.
I think we can have two LOST_SAMPLES records then - one with the new
misc flag for BPF and the other (without the flag) for the usual lost
samples. This would require minimal changes IMHO.
Thanks,
Namhyung
>
> > Also there should be a separate PERF_RECORD_LOST record in the middle
> > when there's actual issue. Without that we can say it's from the BPF.
>
> Right, disambiguating filtered from lost samples is indeed useful.
>
> - Arnaldo
>
> > Thanks,
> > Namhyung
Powered by blists - more mailing lists