linux-kernel - Re: [PATCH v1 1/2] perf trace: Enhance task filtering

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAH0uvojJPirC6OYTbfjj_iS3mErksySjAE7z7Nu6E3CSeOc_6Q@mail.gmail.com>
Date: Fri, 30 May 2025 16:20:18 -0700
From: Howard Chu <howardchu95@...il.com>
To: Namhyung Kim <namhyung@...nel.org>
Cc: acme@...nel.org, mingo@...hat.com, mark.rutland@....com, 
	alexander.shishkin@...ux.intel.com, jolsa@...nel.org, irogers@...gle.com, 
	adrian.hunter@...el.com, peterz@...radead.org, kan.liang@...ux.intel.com, 
	linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1 1/2] perf trace: Enhance task filtering

Hello Namhyung,

On Fri, May 30, 2025 at 2:31 PM Namhyung Kim <namhyung@...nel.org> wrote:
>
> Hi Howard,
>
> On Thu, May 29, 2025 at 11:24:07PM -0700, Howard Chu wrote:
> > This patch does two things:
> > 1. Add a pids_targeted map, put pids that interest perf trace in.
> > 2. Make bpf-output event system-wide.
> >
> > Effect 1:
> > perf trace doesn't augment threads properly. With the script below:
> >
> > Program test_trace_loop.c
> > ~~~
> >     #include <pthread.h>
> >     #include <stdio.h>
> >     #include <unistd.h>
> >     #include <stdlib.h>
> >
> >     #define THREAD_NR 2
> >
> >     struct thread_arg {
> >           int index;
> >     };
> >
> >     void *func(void *arg) {
> >           struct thread_arg *t_arg = arg;
> >           while (1) {
> >                   printf("thread %d running\n", t_arg->index);
> >                   sleep(1);
> >           }
> >           return NULL;
> >     }
> >
> >     int main()
> >     {
> >           pthread_t thread_ids[THREAD_NR];
> >           struct thread_arg thread_args[THREAD_NR];
> >
> >           for (int i = 0; i < THREAD_NR; i++) {
> >                   thread_args[i].index = i;
> >                   if (pthread_create(&thread_ids[i], NULL, &func, &thread_args[i])) {
> >                           perror("failed to create thread, exiting\n");
> >                           exit(1);
> >                   }
> >           }
> >
> >           while (1) {
> >                   printf("parent sleeping\n");
> >                   sleep(1);
> >           }
> >
> >           for (int i = 0; i < THREAD_NR; i++)
> >                   pthread_join(thread_ids[i], NULL);
> >
> >           return 0;
> >     }
> > ~~~
> >
> > Commands
> > ~~~
> > $ gcc test_trace_loop.c -o test_trace_loop
> >
> > $ ./test_trace_loop &
> > [1] 1404183
> >
> > $ pstree 1404183 -p
> > test_trace_loop(1404183)─┬─{test_trace_loop}(1404185)
> >                          └─{test_trace_loop}(1404186)
> >
> > $ sudo perf trace -p 1404183 -e *sleep
> > ~~~
> >
> > Output
> > before:
> > $ sudo /tmp/perf/perf trace -p 1404183 -e *sleep
> >          ? (         ): test_trace_loo/1404186  ... [continued]: clock_nanosleep())                                  = 0
> >          ? (         ): test_trace_loo/1404183  ... [continued]: clock_nanosleep())                                  = 0
> >      0.119 (         ): test_trace_loo/1404186 clock_nanosleep(rqtp: 0x7a86061fde60, rmtp: 0x7a86061fde60)        ...
> >          ? (         ): test_trace_loo/1404185  ... [continued]: clock_nanosleep())                                  = 0
> >      0.047 (         ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ...
> >      0.047 (1000.127 ms): test_trace_loo/1404183  ... [continued]: clock_nanosleep())                                  = 0
> >
> > explanation: only the parent thread 1404183 got augmented
> >
> > after:
> > $ sudo /tmp/perf/perf trace -p 1404183 -e *sleep
> >          ? (         ): test_trace_loo/1404183  ... [continued]: clock_nanosleep())                                  = 0
> >          ? (         ): test_trace_loo/1404186  ... [continued]: clock_nanosleep())                                  = 0
> >      0.147 (         ): test_trace_loo/1404186 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86061fde60) ...
> >          ? (         ): test_trace_loo/1404185  ... [continued]: clock_nanosleep())                                  = 0
> >      0.076 (         ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ...
> >      0.076 (1000.160 ms): test_trace_loo/1404183  ... [continued]: clock_nanosleep())                                  = 0
> >      0.147 (1000.090 ms): test_trace_loo/1404186  ... [continued]: clock_nanosleep())                                  = 0
> >      2.557 (         ): test_trace_loo/1404185 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86069fee60) ...
> >   1000.323 (         ): test_trace_loo/1404186 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86061fde60) ...
> >      2.557 (1000.129 ms): test_trace_loo/1404185  ... [continued]: clock_nanosleep())                                  = 0
> >   1000.384 (         ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ...
> >
> > explanation: all threads augmented
> >
> > Effect 2: perf trace doesn't collect syscall argument data for *ALL*
> > pids, and throw it away anymore. Those uninteresting pids get filtered
> > right away. There should be a performance advantage.
>
> Thanks for doing this!

Thank you for reviewing this patch.

>
> >
> > Signed-off-by: Howard Chu <howardchu95@...il.com>
> > ---
> >  tools/perf/builtin-trace.c                    | 52 ++++++++++++++++---
> >  .../bpf_skel/augmented_raw_syscalls.bpf.c     | 35 ++++++++++---
> >  tools/perf/util/evlist.c                      |  2 +-
> >  3 files changed, 73 insertions(+), 16 deletions(-)
> >
> > diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
> > index 67b557ec3b0d..11620cb40198 100644
> > --- a/tools/perf/builtin-trace.c
> > +++ b/tools/perf/builtin-trace.c
> > @@ -4377,6 +4377,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
> >       unsigned long before;
> >       const bool forks = argc > 0;
> >       bool draining = false;
> > +     bool enable_evlist = false;
> >
> >       trace->live = true;
> >
> > @@ -4447,6 +4448,9 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
> >               evlist__set_default_cgroup(trace->evlist, trace->cgroup);
> >
> >  create_maps:
> > +     if (trace->syscalls.events.bpf_output)
> > +             trace->syscalls.events.bpf_output->core.system_wide = true;
> > +
> >       err = evlist__create_maps(evlist, &trace->opts.target);
> >       if (err < 0) {
> >               fprintf(trace->output, "Problems parsing the target to trace, check your options!\n");
> > @@ -4481,20 +4485,54 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
> >               goto out_error_open;
> >  #ifdef HAVE_BPF_SKEL
> >       if (trace->syscalls.events.bpf_output) {
> > +             struct perf_evsel *perf_evsel = &trace->syscalls.events.bpf_output->core;
> >               struct perf_cpu cpu;
> > +             bool t = true;
> > +
> > +             enable_evlist = true;
> > +             if (trace->opts.target.system_wide)
> > +                     trace->skel->bss->system_wide = true;
> > +             else
> > +                     trace->skel->bss->system_wide = false;
> >
> >               /*
> >                * Set up the __augmented_syscalls__ BPF map to hold for each
> >                * CPU the bpf-output event's file descriptor.
> >                */
> > -             perf_cpu_map__for_each_cpu(cpu, i, trace->syscalls.events.bpf_output->core.cpus) {
> > +             perf_cpu_map__for_each_cpu(cpu, i, perf_evsel->cpus) {
> >                       int mycpu = cpu.cpu;
> >
> > -                     bpf_map__update_elem(trace->skel->maps.__augmented_syscalls__,
> > -                                     &mycpu, sizeof(mycpu),
> > -                                     xyarray__entry(trace->syscalls.events.bpf_output->core.fd,
> > -                                                    mycpu, 0),
> > -                                     sizeof(__u32), BPF_ANY);
> > +                     err = bpf_map__update_elem(trace->skel->maps.__augmented_syscalls__,
> > +                                                &mycpu, sizeof(mycpu),
> > +                                                xyarray__entry(perf_evsel->fd, mycpu, 0),
> > +                                                sizeof(__u32), BPF_ANY);
> > +                     if (err) {
> > +                             pr_err("Couldn't set system-wide bpf output perf event fd"
> > +                                    ", err: %d\n", err);
> > +                             goto out_disable;
> > +                     }
> > +             }
> > +
> > +             if (target__has_task(&trace->opts.target)) {
> > +                     struct perf_thread_map *threads = trace->evlist->core.threads;
> > +
> > +                     for (int thread = 0; thread < perf_thread_map__nr(threads); thread++) {
> > +                             pid_t pid = perf_thread_map__pid(threads, thread);
> > +
> > +                             err = bpf_map__update_elem(trace->skel->maps.pids_targeted, &pid,
> > +                                                        sizeof(pid), &t, sizeof(t), BPF_ANY);
> > +                             if (err) {
> > +                                     pr_err("Couldn't set pids_targeted map, err: %d\n", err);
> > +                                     goto out_disable;
> > +                             }
> > +                     }
> > +             } else if (workload_pid != -1) {
> > +                     err = bpf_map__update_elem(trace->skel->maps.pids_targeted, &workload_pid,
> > +                                                sizeof(workload_pid), &t, sizeof(t), BPF_ANY);
> > +                     if (err) {
> > +                             pr_err("Couldn't set pids_targeted map for workload, err: %d\n", err);
> > +                             goto out_disable;
> > +                     }
> >               }
> >       }
> >
> > @@ -4553,7 +4591,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
> >                       goto out_error_mmap;
> >       }
> >
> > -     if (!target__none(&trace->opts.target) && !trace->opts.target.initial_delay)
> > +     if (enable_evlist || (!target__none(&trace->opts.target) && !trace->opts.target.initial_delay))
>
> I guess target__none() should not call evlist__enable() here.

Tracing a workload makes target__none() return true... But I need the
evlist enabled for the workload. This statement can be written better:
if bpf-output event is used, enable evlist.

>
>
> >               evlist__enable(evlist);
> >
> >       if (forks)
> > diff --git a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> > index e4352881e3fa..e517eec7290b 100644
> > --- a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> > +++ b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> > @@ -26,6 +26,7 @@
> >  #define is_power_of_2(n) (n != 0 && ((n & (n - 1)) == 0))
> >
> >  #define MAX_CPUS  4096
> > +#define MAX_PIDS  4096
> >
> >  /* bpf-output associated map */
> >  struct __augmented_syscalls__ {
> > @@ -113,6 +114,15 @@ struct pids_filtered {
> >       __uint(max_entries, 64);
> >  } pids_filtered SEC(".maps");
> >
> > +volatile bool system_wide;
> > +
> > +struct pids_targeted {
> > +     __uint(type, BPF_MAP_TYPE_HASH);
> > +     __type(key, pid_t);
> > +     __type(value, bool);
> > +     __uint(max_entries, MAX_PIDS);
> > +} pids_targeted SEC(".maps");
> > +
> >  struct augmented_args_payload {
> >       struct syscall_enter_args args;
> >       struct augmented_arg arg, arg2; // We have to reserve space for two arguments (rename, etc)
> > @@ -145,6 +155,11 @@ struct beauty_payload_enter_map {
> >       __uint(max_entries, 1);
> >  } beauty_payload_enter_map SEC(".maps");
> >
> > +static pid_t getpid(void)
> > +{
> > +     return bpf_get_current_pid_tgid();
> > +}
> > +
> >  static inline struct augmented_args_payload *augmented_args_payload(void)
> >  {
> >       int key = 0;
> > @@ -418,14 +433,18 @@ int sys_enter_nanosleep(struct syscall_enter_args *args)
> >       return 1; /* Failure: don't filter */
> >  }
> >
> > -static pid_t getpid(void)
> > +static bool filter_pid(void)
> >  {
> > -     return bpf_get_current_pid_tgid();
> > -}
> > +     if (system_wide)
> > +             return false;
>
> Doesn't it need to check CPU list when -C option is used?

I agree, thanks.

>
> >
> > -static bool pid_filter__has(struct pids_filtered *pids, pid_t pid)
> > -{
> > -     return bpf_map_lookup_elem(pids, &pid) != NULL;
> > +     pid_t pid = getpid();
> > +
> > +     if (bpf_map_lookup_elem(&pids_targeted, &pid) &&
> > +         !bpf_map_lookup_elem(&pids_filtered, &pid))
>
> Can we just use a single map for this purpose?

pids_targeted allows certain pids, and pids_filtered exclude some
pids. I actually made a mistake here because when doing system-wide
tracing, pids_filtered should be checked too (in my code I just return
if system_wide is true). In per-task tracing however, I think we can
squash them together, if any pid in pids_filtered is also in
pids_targeted,  we still keep that pid since the user specified it, if
not, it gets filtered anyway since it's not in pids_targeted. So you
are right about using a single map for per-task tracing, but only
because pids_filtered is used in system-wide mode, I don't think we
can delete the map.

Thanks,
Howard