linux-kernel - Re: [PATCH] perf bpf-filter: Support multiple events properly

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZrFAIc0G8n0-zgxt@google.com>
Date: Mon, 5 Aug 2024 14:12:01 -0700
From: Namhyung Kim <namhyung@...nel.org>
To: Arnaldo Carvalho de Melo <acme@...nel.org>
Cc: Ian Rogers <irogers@...gle.com>, Kan Liang <kan.liang@...ux.intel.com>,
	Jiri Olsa <jolsa@...nel.org>,
	Adrian Hunter <adrian.hunter@...el.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>, LKML <linux-kernel@...r.kernel.org>,
	linux-perf-users@...r.kernel.org, KP Singh <kpsingh@...nel.org>,
	Song Liu <song@...nel.org>, bpf@...r.kernel.org,
	Stephane Eranian <eranian@...gle.com>
Subject: Re: [PATCH] perf bpf-filter: Support multiple events properly

On Mon, Aug 05, 2024 at 04:31:34PM -0300, Arnaldo Carvalho de Melo wrote:
> On Mon, Aug 05, 2024 at 11:56:56AM -0700, Namhyung Kim wrote:
> > On Mon, Aug 05, 2024 at 12:03:14PM -0300, Arnaldo Carvalho de Melo wrote:
> > > On Fri, Aug 02, 2024 at 10:37:52AM -0700, Namhyung Kim wrote:
> > > > So far it used tgid as a key to get the filter expressions in the
> > > > pinned filters map for regular users but it won't work well if the has
> > > > more than one filters at the same time.  Let's add the event id to the
> > > > key of the filter hash map so that it can identify the right filter
> > > > expression in the BPF program.
> > > > 
> > > > As the event can be inherited to child tasks, it should use the primary
> > > > id which belongs to the parent (original) event.  Since evsel opens the
> > > > event for multiple CPUs and tasks, it needs to maintain a separate hash
> > > > map for the event id.
> > > 
> > > I'm trying to test it now, it would be nice to have the series of events
> > > needed to test that the feature is working.
> > 
> > Sure, I used the following command.
> > 
> >   ./perf record -e cycles --filter 'ip < 0xffffffff00000000' -e instructions --filter 'period < 100000' -o- ./perf test -w noploop | ./perf script -i-
> 
> Thanks
>  
> > > 
> > > Some comments below.
> > >  
> > > > In the user space, it keeps a list for the multiple evsel and release
> > > > the entries in the both hash map when it closes the event.
> > > > 
> > > > Signed-off-by: Namhyung Kim <namhyung@...nel.org>
> > > > ---
> > > >  tools/perf/util/bpf-filter.c                 | 288 ++++++++++++++++---
> > > >  tools/perf/util/bpf_skel/sample-filter.h     |  11 +-
> > > >  tools/perf/util/bpf_skel/sample_filter.bpf.c |  42 ++-
> > > >  tools/perf/util/bpf_skel/vmlinux/vmlinux.h   |   5 +
> > > >  4 files changed, 304 insertions(+), 42 deletions(-)
> > > > 
> > > > diff --git a/tools/perf/util/bpf-filter.c b/tools/perf/util/bpf-filter.c
> > > > index c5eb0b7eec19..69b147cba969 100644
> > > > --- a/tools/perf/util/bpf-filter.c
> > > > +++ b/tools/perf/util/bpf-filter.c
> > > > @@ -1,4 +1,45 @@
> > > >  /* SPDX-License-Identifier: GPL-2.0 */
> > > > +/**
> > > > + * Generic event filter for sampling events in BPF.
> > > > + *
> > > > + * The BPF program is fixed and just to read filter expressions in the 'filters'
> > > > + * map and compare the sample data in order to reject samples that don't match.
> > > > + * Each filter expression contains a sample flag (term) to compare, an operation
> > > > + * (==, >=, and so on) and a value.
> > > > + *
> > > > + * Note that each entry has an array of filter repxressions and it only succeeds
> > > 
> > >                                                   expressions
> > 
> > Oops, thanks.
> > 
> > > 
> > > > + * when all of the expressions are satisfied.  But it supports the logical OR
> > > > + * using a GROUP operation which is satisfied when any of its member expression
> > > > + * is evaluated to true.  But it doesn't allow nested GROUP operations for now.
> > > > + *
> > > > + * To support non-root users, the filters map can be loaded and pinned in the BPF
> > > > + * filesystem by root (perf record --setup-filter pin).  Then each user will get
> > > > + * a new entry in the shared filters map to fill the filter expressions.  And the
> > > > + * BPF program will find the filter using (task-id, event-id) as a key.
> > > > + *
> > > > + * The pinned BPF object (shared for regular users) has:
> > > > + *
> > > > + *                  event_hash                   |
> > > > + *                  |        |                   |
> > > > + *   event->id ---> |   id   | ---+   idx_hash   |     filters
> > > > + *                  |        |    |   |      |   |    |       |
> > > > + *                  |  ....  |    +-> |  idx | --+--> | exprs | --->  perf_bpf_filter_entry[]
> > > > + *                                |   |      |   |    |       |               .op
> > > > + *   task id (tgid) --------------+   | .... |   |    |  ...  |               .term (+ part)
> > > > + *                                               |                            .value
> > > > + *                                               |
> > > > + *   ======= (root would skip this part) ========                     (compares it in a loop)
> > > > + *
> > > > + * This is used for per-task use cases while system-wide profiling (normally from
> > > > + * root user) uses a separate copy of the program and the maps for its own so that
> > > > + * it can proceed even if a lot of non-root users are using the filters at the
> > > > + * same time.  In this case the filters map has a single entry and no need to use
> > > > + * the hash maps to get the index (key) of the filters map (IOW it's always 0).
> > > > + *
> > > > + * The BPF program returns 1 to accept the sample or 0 to drop it.
> > > > + * The 'dropped' map is to keep how many samples it dropped by the filter and
> > > > + * it will be reported as lost samples.
> > > 
> > > I think there is value in reporting how many were filtered out, I'm just
> > > unsure about reporting it as "lost" samples, as lost has another
> > > semantic associated, i.e. ring buffer was full or couldn't process it
> > > for some other resource starvation issue, no?
> > 
> > Then we need a way to save the information.  It could be a new record
> > type (PERF_RECORD_DROPPED_SAMPLES), a new misc flag in the lost samples
> 
> I guess "PERF_RECORD_FILTERED_SAMPLES" would be better, more precise,
> wdyt?
> 
> > record or a header field.  I prefer the misc flag.
> 
> I think we can have both filtered and lost samples, so I would prefer
> the new record type.

I think we can have two LOST_SAMPLES records then - one with the new
misc flag for BPF and the other (without the flag) for the usual lost
samples.  This would require minimal changes IMHO.

Thanks,
Namhyung

>  
> > Also there should be a separate PERF_RECORD_LOST record in the middle
> > when there's actual issue.  Without that we can say it's from the BPF.
> 
> Right, disambiguating filtered from lost samples is indeed useful.
> 
> - Arnaldo
>  
> > Thanks,
> > Namhyung