[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM9d7cgB-iw0phVTb9EH5ps=1+PQM5nU1fTLMfQx=Kj31W-8Jw@mail.gmail.com>
Date: Thu, 19 May 2022 14:01:25 -0700
From: Namhyung Kim <namhyung@...nel.org>
To: Ian Rogers <irogers@...gle.com>
Cc: Arnaldo Carvalho de Melo <acme@...nel.org>,
Jiri Olsa <jolsa@...nel.org>, Ingo Molnar <mingo@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
LKML <linux-kernel@...r.kernel.org>,
Andi Kleen <ak@...ux.intel.com>,
Song Liu <songliubraving@...com>, Hao Luo <haoluo@...gle.com>,
Milian Wolff <milian.wolff@...b.com>,
bpf <bpf@...r.kernel.org>,
linux-perf-users <linux-perf-users@...r.kernel.org>,
Blake Jones <blakejones@...gle.com>
Subject: Re: [PATCH 2/6] perf record: Enable off-cpu analysis with BPF
Hi Ian,
On Wed, May 18, 2022 at 8:58 PM Ian Rogers <irogers@...gle.com> wrote:
>
> On Wed, May 18, 2022 at 3:47 PM Namhyung Kim <namhyung@...nel.org> wrote:
> >
> > Add --off-cpu option to enable the off-cpu profiling with BPF. It'd
> > use a bpf_output event and rename it to "offcpu-time". Samples will
> > be synthesized at the end of the record session using data from a BPF
> > map which contains the aggregated off-cpu time at context switches.
> > So it needs root privilege to get the off-cpu profiling.
> >
> > Each sample will have a separate user stacktrace so it will skip
> > kernel threads. The sample ip will be set from the stacktrace and
> > other sample data will be updated accordingly. Currently it only
> > handles some basic sample types.
> >
> > The sample timestamp is set to a dummy value just not to bother with
> > other events during the sorting. So it has a very big initial value
> > and increase it on processing each samples.
> >
> > Good thing is that it can be used together with regular profiling like
> > cpu cycles. If you don't want to that, you can use a dummy event to
> > enable off-cpu profiling only.
> >
> > Example output:
> > $ sudo perf record --off-cpu perf bench sched messaging -l 1000
> >
> > $ sudo perf report --stdio --call-graph=no
> > # Total Lost Samples: 0
> > #
> > # Samples: 41K of event 'cycles'
> > # Event count (approx.): 42137343851
> > ...
> >
> > # Samples: 1K of event 'offcpu-time'
> > # Event count (approx.): 587990831640
> > #
> > # Children Self Command Shared Object Symbol
> > # ........ ........ ............... .................. .........................
> > #
> > 81.66% 0.00% sched-messaging libc-2.33.so [.] __libc_start_main
> > 81.66% 0.00% sched-messaging perf [.] cmd_bench
> > 81.66% 0.00% sched-messaging perf [.] main
> > 81.66% 0.00% sched-messaging perf [.] run_builtin
> > 81.43% 0.00% sched-messaging perf [.] bench_sched_messaging
> > 40.86% 40.86% sched-messaging libpthread-2.33.so [.] __read
> > 37.66% 37.66% sched-messaging libpthread-2.33.so [.] __write
> > 2.91% 2.91% sched-messaging libc-2.33.so [.] __poll
> > ...
> >
> > As you can see it spent most of off-cpu time in read and write in
> > bench_sched_messaging(). The --call-graph=no was added just to make
> > the output concise here.
> >
> > It uses perf hooks facility to control BPF program during the record
> > session rather than adding new BPF/off-cpu specific calls.
> >
> > Signed-off-by: Namhyung Kim <namhyung@...nel.org>
>
> Acked-by: Ian Rogers <irogers@...gle.com>
>
> > ---
> > tools/perf/Documentation/perf-record.txt | 10 ++
> > tools/perf/Makefile.perf | 1 +
> > tools/perf/builtin-record.c | 25 +++
> > tools/perf/util/Build | 1 +
> > tools/perf/util/bpf_off_cpu.c | 204 +++++++++++++++++++++++
> > tools/perf/util/bpf_skel/off_cpu.bpf.c | 139 +++++++++++++++
> > tools/perf/util/off_cpu.h | 24 +++
> > 7 files changed, 404 insertions(+)
> > create mode 100644 tools/perf/util/bpf_off_cpu.c
> > create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c
> > create mode 100644 tools/perf/util/off_cpu.h
> >
> > diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> > index 465be4e62a17..b4e9ef7edfef 100644
> > --- a/tools/perf/Documentation/perf-record.txt
> > +++ b/tools/perf/Documentation/perf-record.txt
> > @@ -758,6 +758,16 @@ include::intel-hybrid.txt[]
> > If the URLs is not specified, the value of DEBUGINFOD_URLS
> > system environment variable is used.
> >
> > +--off-cpu::
> > + Enable off-cpu profiling with BPF. The BPF program will collect
> > + task scheduling information with (user) stacktrace and save them
> > + as sample data of a software event named "offcpu-time". The
> > + sample period will have the time the task slept in nanoseconds.
> > +
> > + Note that BPF can collect stack traces using frame pointer ("fp")
> > + only, as of now. So the applications built without the frame
> > + pointer might see bogus addresses.
> > +
> > SEE ALSO
> > --------
> > linkperf:perf-stat[1], linkperf:perf-list[1], linkperf:perf-intel-pt[1]
> > diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
> > index 6e5aded855cc..8f738e11356d 100644
> > --- a/tools/perf/Makefile.perf
> > +++ b/tools/perf/Makefile.perf
> > @@ -1038,6 +1038,7 @@ SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
> > SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
> > SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
> > SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h $(SKEL_OUT)/func_latency.skel.h
> > +SKELETONS += $(SKEL_OUT)/off_cpu.skel.h
> >
> > $(SKEL_TMP_OUT) $(LIBBPF_OUTPUT):
> > $(Q)$(MKDIR) -p $@
> > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > index a5cf6a99d67f..91f88501412e 100644
> > --- a/tools/perf/builtin-record.c
> > +++ b/tools/perf/builtin-record.c
> > @@ -49,6 +49,7 @@
> > #include "util/clockid.h"
> > #include "util/pmu-hybrid.h"
> > #include "util/evlist-hybrid.h"
> > +#include "util/off_cpu.h"
> > #include "asm/bug.h"
> > #include "perf.h"
> > #include "cputopo.h"
> > @@ -162,6 +163,7 @@ struct record {
> > bool buildid_mmap;
> > bool timestamp_filename;
> > bool timestamp_boundary;
> > + bool off_cpu;
> > struct switch_output switch_output;
> > unsigned long long samples;
> > unsigned long output_max_size; /* = 0: unlimited */
> > @@ -903,6 +905,11 @@ static int record__config_text_poke(struct evlist *evlist)
> > return 0;
> > }
> >
> > +static int record__config_off_cpu(struct record *rec)
> > +{
> > + return off_cpu_prepare(rec->evlist);
> > +}
> > +
> > static bool record__kcore_readable(struct machine *machine)
> > {
> > char kcore[PATH_MAX];
> > @@ -2600,6 +2607,9 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> > } else
> > status = err;
> >
> > + if (rec->off_cpu)
> > + rec->bytes_written += off_cpu_write(rec->session);
> > +
> > record__synthesize(rec, true);
> > /* this will be recalculated during process_buildids() */
> > rec->samples = 0;
> > @@ -3324,6 +3334,7 @@ static struct option __record_options[] = {
> > OPT_CALLBACK_OPTARG(0, "threads", &record.opts, NULL, "spec",
> > "write collected trace data into several data files using parallel threads",
> > record__parse_threads),
> > + OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
> > OPT_END()
> > };
> >
> > @@ -3743,6 +3754,12 @@ int cmd_record(int argc, const char **argv)
> > set_nobuild('\0', "vmlinux", true);
> > # undef set_nobuild
> > # undef REASON
> > +#endif
> > +
> > +#ifndef HAVE_BPF_SKEL
> > +# define set_nobuild(s, l, m, c) set_option_nobuild(record_options, s, l, m, c)
> > + set_nobuild('\0', "off-cpu", "no BUILD_BPF_SKEL=1", true);
> > +# undef set_nobuild
> > #endif
> >
> > rec->opts.affinity = PERF_AFFINITY_SYS;
> > @@ -3981,6 +3998,14 @@ int cmd_record(int argc, const char **argv)
> > }
> > }
> >
> > + if (rec->off_cpu) {
> > + err = record__config_off_cpu(rec);
> > + if (err) {
> > + pr_err("record__config_off_cpu failed, error %d\n", err);
> > + goto out;
> > + }
> > + }
> > +
> > if (record_opts__config(&rec->opts)) {
> > err = -EINVAL;
> > goto out;
> > diff --git a/tools/perf/util/Build b/tools/perf/util/Build
> > index 9a7209a99e16..a51267d88ca9 100644
> > --- a/tools/perf/util/Build
> > +++ b/tools/perf/util/Build
> > @@ -147,6 +147,7 @@ perf-$(CONFIG_LIBBPF) += bpf_map.o
> > perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o
> > perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o
> > perf-$(CONFIG_PERF_BPF_SKEL) += bpf_ftrace.o
> > +perf-$(CONFIG_PERF_BPF_SKEL) += bpf_off_cpu.o
> > perf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
> > perf-$(CONFIG_LIBELF) += symbol-elf.o
> > perf-$(CONFIG_LIBELF) += probe-file.o
> > diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> > new file mode 100644
> > index 000000000000..9ed7aca3f4ac
> > --- /dev/null
> > +++ b/tools/perf/util/bpf_off_cpu.c
> > @@ -0,0 +1,204 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "util/bpf_counter.h"
> > +#include "util/debug.h"
> > +#include "util/evsel.h"
> > +#include "util/evlist.h"
> > +#include "util/off_cpu.h"
> > +#include "util/perf-hooks.h"
> > +#include "util/session.h"
> > +#include <bpf/bpf.h>
> > +
> > +#include "bpf_skel/off_cpu.skel.h"
> > +
> > +#define MAX_STACKS 32
> > +/* we don't need actual timestamp, just want to put the samples at last */
> > +#define OFF_CPU_TIMESTAMP (~0ull << 32)
> > +
> > +static struct off_cpu_bpf *skel;
> > +
> > +struct off_cpu_key {
> > + u32 pid;
> > + u32 tgid;
> > + u32 stack_id;
> > + u32 state;
> > +};
> > +
> > +union off_cpu_data {
> > + struct perf_event_header hdr;
> > + u64 array[1024 / sizeof(u64)];
> > +};
> > +
> > +static int off_cpu_config(struct evlist *evlist)
> > +{
> > + struct evsel *evsel;
> > + struct perf_event_attr attr = {
> > + .type = PERF_TYPE_SOFTWARE,
> > + .config = PERF_COUNT_SW_BPF_OUTPUT,
> > + .size = sizeof(attr), /* to capture ABI version */
> > + };
> > + char *evname = strdup(OFFCPU_EVENT);
> > +
> > + if (evname == NULL)
> > + return -ENOMEM;
> > +
> > + evsel = evsel__new(&attr);
> > + if (!evsel) {
> > + free(evname);
> > + return -ENOMEM;
> > + }
> > +
> > + evsel->core.attr.freq = 1;
> > + evsel->core.attr.sample_period = 1;
> > + /* off-cpu analysis depends on stack trace */
> > + evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
> > +
> > + evlist__add(evlist, evsel);
> > +
> > + free(evsel->name);
> > + evsel->name = evname;
> > +
> > + return 0;
> > +}
> > +
> > +static void off_cpu_start(void *arg __maybe_unused)
> > +{
> > + skel->bss->enabled = 1;
> > +}
> > +
> > +static void off_cpu_finish(void *arg __maybe_unused)
> > +{
> > + skel->bss->enabled = 0;
> > + off_cpu_bpf__destroy(skel);
> > +}
> > +
> > +int off_cpu_prepare(struct evlist *evlist)
> > +{
> > + int err;
> > +
> > + if (off_cpu_config(evlist) < 0) {
> > + pr_err("Failed to config off-cpu BPF event\n");
> > + return -1;
> > + }
> > +
> > + set_max_rlimit();
> > +
> > + skel = off_cpu_bpf__open_and_load();
> > + if (!skel) {
> > + pr_err("Failed to open off-cpu BPF skeleton\n");
> > + return -1;
> > + }
> > +
> > + err = off_cpu_bpf__attach(skel);
> > + if (err) {
> > + pr_err("Failed to attach off-cpu BPF skeleton\n");
> > + goto out;
> > + }
> > +
> > + if (perf_hooks__set_hook("record_start", off_cpu_start, NULL) ||
> > + perf_hooks__set_hook("record_end", off_cpu_finish, NULL)) {
>
> An off-topic thought here. I was looking at tool events and thinking
> that rather than have global state and the counter reading getting
> those global values, it would be nice if the tool events had private
> state that was created on open, freed on close and then did the
> appropriate thing for enable/read/disable. If we were object oriented
> I was thinking tool events could be a subclass of evsel that would
> override the appropriate functions. Something that strikes me as silly
> is that tool events have a dummy event file descriptor from
> perf_event_open, in part because the evsel code paths are shared.
> Anyway, if we made evsels have something like a virtual method table,
> we could do similar tricks with off-cpu and BPF created events.
> Possibly this could lead to better structured code and more reuse.
Thanks for your review.
Yeah I think it makes sense to change the offcpu-time event as a
tool event and to skip actual open/read/close operations. It would
be a good follow up work.
Using a dummy event still has a syscall overhead and tool events
don't need the syscalls. Maybe we can add some logic in the evlist
code to skip the tool events.
Thanks,
Namhyung
Powered by blists - more mailing lists