[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAH0uvojWQz_K5S7nLavGW6K6mU6Y7TKoZAWpcut=AgbXtr6Wng@mail.gmail.com>
Date: Wed, 4 Sep 2024 14:11:35 -0700
From: Howard Chu <howardchu95@...il.com>
To: Arnaldo Carvalho de Melo <acme@...nel.org>
Cc: adrian.hunter@...el.com, irogers@...gle.com, jolsa@...nel.org,
kan.liang@...ux.intel.com, namhyung@...nel.org,
linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org,
Arnaldo Carvalho de Melo <acme@...hat.com>
Subject: Re: [PATCH v3 6/8] perf trace: Collect augmented data using BPF
Hello Arnaldo,
On Wed, Sep 4, 2024 at 12:53 PM Arnaldo Carvalho de Melo
<acme@...nel.org> wrote:
>
> On Sun, Aug 25, 2024 at 12:33:20AM +0800, Howard Chu wrote:
> > Include trace_augment.h for TRACE_AUG_MAX_BUF, so that BPF reads
> > TRACE_AUG_MAX_BUF bytes of buffer maximum.
> >
> > Determine what type of argument and how many bytes to read from user space, us ing the
> > value in the beauty_map. This is the relation of parameter type and its corres ponding
> > value in the beauty map, and how many bytes we read eventually:
> >
> > string: 1 -> size of string (till null)
> > struct: size of struct -> size of struct
> > buffer: -1 * (index of paired len) -> value of paired len (maximum: TRACE_AUG_ MAX_BUF)
> >
> > After reading from user space, we output the augmented data using
> > bpf_perf_event_output().
> >
> > If the struct augmenter, augment_sys_enter() failed, we fall back to
> > using bpf_tail_call().
> >
> > I have to make the payload 6 times the size of augmented_arg, to pass the
> > BPF verifier.
> >
> > Committer notes:
> >
> > It works, but we need to wire it up to the userspace specialized pretty
> > printer, otherwise we get things like:
> >
> > root@...ber:~# perf trace -e connect ssh localhost
> > 0.000 ( 0.010 ms): :784442/784442 connect(fd: 3, uservaddr: {2,}, addrlen: 16) = 0
> > 0.016 ( 0.006 ms): :784442/784442 connect(fd: 3, uservaddr: {10,}, addrlen: 28) = 0
> > 0.033 ( 0.096 ms): :784442/784442 connect(fd: 3, uservaddr: {10,}, addrlen: 28) = 0
> > root@...alhost's password: 71.292 ( 0.037 ms): ssh/784442 connect(fd: 4, uservaddr: {1,{['/','v','a','r','/','r','u','n','/','.','h','e','i','m',],},}, addrlen: 110) = 0
> > 72.087 ( 0.013 ms): ssh/784442 connect(fd: 4, uservaddr: {1,{['/','v','a','r','/','r','u','n','/','.','h','e','i','m',],},}, addrlen: 110) = 0
> >
> > root@...ber:~#
> >
> > When we used to have:
> >
> > root@...ber:~# perf trace -e connect ssh localhost
> > 0.000 ( 0.011 ms): ssh/786564 connect(fd: 3, uservaddr: { .family: INET, port: 22, addr: 127.0.0.1 }, addrlen: 16) = 0
> > 0.017 ( 0.006 ms): ssh/786564 connect(fd: 3, uservaddr: { .family: INET6, port: 22, addr: ::1 }, addrlen: 28) = 0
> > 0.263 ( 0.043 ms): ssh/786564 connect(fd: 3, uservaddr: { .family: INET6, port: 22, addr: ::1 }, addrlen: 28) = 0
> > 63.770 ( 0.044 ms): ssh/786564 connect(fd: 4, uservaddr: { .family: LOCAL, path: /var/run/.heim_org.h5l.kcm-socket }, addrlen: 110) = 0
> > 65.467 ( 0.042 ms): ssh/786564 connect(fd: 4, uservaddr: { .family: LOCAL, path: /var/run/.heim_org.h5l.kcm-socket }, addrlen: 110) = 0
> > root@...alhost's password:
> >
> > That is closer to what strace produces:
> >
> > root@...ber:~# strace -e connect ssh localhost
> > connect(3, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
> > connect(3, {sa_family=AF_INET6, sin6_port=htons(22), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0
> > connect(3, {sa_family=AF_INET6, sin6_port=htons(22), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0
> > connect(4, {sa_family=AF_UNIX, sun_path="/var/run/.heim_org.h5l.kcm-socket"}, 110) = 0
> > connect(4, {sa_family=AF_UNIX, sun_path="/var/run/.heim_org.h5l.kcm-socket"}, 110) = 0
> > root@...alhost's password:
> >
> > Signed-off-by: Howard Chu <howardchu95@...il.com>
> > Tested-by: Arnaldo Carvalho de Melo <acme@...hat.com>
> > Cc: Adrian Hunter <adrian.hunter@...el.com>
> > Cc: Ian Rogers <irogers@...gle.com>
> > Cc: Jiri Olsa <jolsa@...nel.org>
> > Cc: Kan Liang <kan.liang@...ux.intel.com>
> > Cc: Namhyung Kim <namhyung@...nel.org>
> > Link: https://lore.kernel.org/r/20240815013626.935097-10-howardchu95@gmail.com
> > Signed-off-by: Arnaldo Carvalho de Melo <acme@...hat.com>
> > ---
> > .../bpf_skel/augmented_raw_syscalls.bpf.c | 114 +++++++++++++++++-
> > 1 file changed, 113 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> > index 0acbd74e8c76..f29a8dfca044 100644
> > --- a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> > +++ b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> > @@ -7,6 +7,8 @@
> > */
> >
> > #include "vmlinux.h"
> > +#include "../trace_augment.h"
> > +
> > #include <bpf/bpf_helpers.h>
> > #include <linux/limits.h>
> >
> > @@ -124,6 +126,25 @@ struct augmented_args_tmp {
> > __uint(max_entries, 1);
> > } augmented_args_tmp SEC(".maps");
> >
> > +struct beauty_payload_enter {
> > + struct syscall_enter_args args;
> > + struct augmented_arg aug_args[6];
> > +};
>
> ⬢[acme@...lbox perf-tools-next]$ pahole -C augmented_arg /tmp/build/perf-tools-next/util/bpf_skel/.tmp/augmented_raw_syscalls.bpf.o
> struct augmented_arg {
> unsigned int size; /* 0 4 */
> int err; /* 4 4 */
> char value[4096]; /* 8 4096 */
>
> /* size: 4104, cachelines: 65, members: 3 */
> /* last cacheline: 8 bytes */
> };
>
> ⬢[acme@...lbox perf-tools-next]$
>
> ⬢[acme@...lbox perf-tools-next]$ pahole -C syscall_enter_args /tmp/build/perf-tools-next/util/bpf_skel/.tmp/augmented_raw_syscalls.bpf.o
> struct syscall_enter_args {
> unsigned long long common_tp_fields; /* 0 8 */
> long syscall_nr; /* 8 8 */
> unsigned long args[6]; /* 16 48 */
>
> /* size: 64, cachelines: 1, members: 3 */
> };
>
> So the entry has the theoretical max limit for the augmented payload
> which would be 6 * sizeof(struct augmented_arg) + the common header for
> tracepoints (unaugmented), a lot of space, but...
Yes, if I don't use this much space, and try to do a flexible payload
length, I won't be able to pass the BPF verifier. I will try to fix
this wasting problem.
>
> > +static int augment_sys_enter(void *ctx, struct syscall_enter_args *args)
> > +{
> > + bool augmented, do_output = false;
> > + int zero = 0, size, aug_size, index, output = 0,
> > + value_size = sizeof(struct augmented_arg) - offsetof(struct augmented_arg, value);
> > + unsigned int nr, *beauty_map;
> > + struct beauty_payload_enter *payload;
> > + void *arg, *payload_offset;
> > +
> > + /* fall back to do predefined tail call */
> > + if (args == NULL)
> > + return 1;
> > +
> > + /* use syscall number to get beauty_map entry */
> > + nr = (__u32)args->syscall_nr;
> > + beauty_map = bpf_map_lookup_elem(&beauty_map_enter, &nr);
>
> If we don't set up the entry for this syscall, then there are no
> pointers to collect, we return early and the tracepoint will not filter
> it and there it goes, up unmodified, if not already filtered by a
> tracepoint filter, ok, I'll add these comments here.
Thank you.
>
> > + /* set up payload for output */
> > + payload = bpf_map_lookup_elem(&beauty_payload_enter_map, &zero);
> > + payload_offset = (void *)&payload->aug_args;
> > > + if (beauty_map == NULL || payload == NULL)
> > + return 1;
>
> We can save the lookup for payload if the one for beauty_map returned
> NULL, I'll do this fixup.
Thanks that makes sense.
>
> > + /* copy the sys_enter header, which has the syscall_nr */
> > + __builtin_memcpy(&payload->args, args, sizeof(struct syscall_enter_args));
>
>
> > + /*
> > + * Determine what type of argument and how many bytes to read from user space, using the
> > + * value in the beauty_map. This is the relation of parameter type and its corresponding
> > + * value in the beauty map, and how many bytes we read eventually:
> > + *
> > + * string: 1 -> size of string
> > + * struct: size of struct -> size of struct
> > + * buffer: -1 * (index of paired len) -> value of paired len (maximum: TRACE_AUG_MAX_BUF)
>
> 'paired_len'? Ok, 1, size of string or struct
I mean the buffer always comes with a size indicator. For instance
read syscall's 'buf' and 'count', in this case 'count' will pair with
'buf'. Sorry for the confusion. See below:
read: (unsigned int fd, char *buf, size_t count)
>
>
> Will continue the detailed analysis later, as I need to change the
> existing BPF collector, before applying this, to match the protocol here
> (that will always have the size of each argument encoded in the ring
> buffer, easier, but uses more space).
Thank you for adding the protocol data size fix.
>
> - Arnaldo
>
> > + */
> > + for (int i = 0; i < 6; i++) {
> > + arg = (void *)args->args[i];
> > + augmented = false;
> > + size = beauty_map[i];
> > + aug_size = size; /* size of the augmented data read from user space */
> > +
> > + if (size == 0 || arg == NULL)
> > + continue;
> > +
> > + if (size == 1) { /* string */
> > + aug_size = bpf_probe_read_user_str(((struct augmented_arg *)payload_offset)->value, value_size, arg);
> > + /* minimum of 0 to pass the verifier */
> > + if (aug_size < 0)
> > + aug_size = 0;
> > +
> > + augmented = true;
> > + } else if (size > 0 && size <= value_size) { /* struct */
> > + if (!bpf_probe_read_user(((struct augmented_arg *)payload_offset)->value, size, arg))
> > + augmented = true;
> > + } else if (size < 0 && size >= -6) { /* buffer */
> > + index = -(size + 1);
> > + aug_size = args->args[index];
> > +
> > + if (aug_size > TRACE_AUG_MAX_BUF)
> > + aug_size = TRACE_AUG_MAX_BUF;
> > +
> > + if (aug_size > 0) {
> > + if (!bpf_probe_read_user(((struct augmented_arg *)payload_offset)->value, aug_size, arg))
> > + augmented = true;
> > + }
> > + }
> > +
> > + /* write data to payload */
> > + if (augmented) {
> > + int written = offsetof(struct augmented_arg, value) + aug_size;
> > +
> > + ((struct augmented_arg *)payload_offset)->size = aug_size;
> > + output += written;
> > + payload_offset += written;
> > + do_output = true;
> > + }
> > + }
> > +
> > + if (!do_output)
> > + return 1;
> > +
> > + return augmented__beauty_output(ctx, payload, sizeof(struct syscall_enter_args) + output);
>
> We end up pushing up to the ring buffer all the way to
Thanks,
Howard
Powered by blists - more mailing lists