[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CO6PR11MB5635C7D8C91FEDA17EC4BCDEEE112@CO6PR11MB5635.namprd11.prod.outlook.com>
Date: Tue, 23 Apr 2024 22:16:50 +0000
From: "Wang, Weilin" <weilin.wang@...el.com>
To: Namhyung Kim <namhyung@...nel.org>
CC: Ian Rogers <irogers@...gle.com>, Arnaldo Carvalho de Melo
<acme@...nel.org>, Peter Zijlstra <peterz@...radead.org>, Ingo Molnar
<mingo@...hat.com>, Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Jiri Olsa <jolsa@...nel.org>, "Hunter, Adrian" <adrian.hunter@...el.com>, Kan
Liang <kan.liang@...ux.intel.com>, "linux-perf-users@...r.kernel.org"
<linux-perf-users@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "Taylor, Perry" <perry.taylor@...el.com>,
"Alt, Samantha" <samantha.alt@...el.com>, "Biggers, Caleb"
<caleb.biggers@...el.com>
Subject: RE: [RFC PATCH v6 2/5] perf stat: Fork and launch perf record when
perf stat needs to get retire latency value for a metric.
> -----Original Message-----
> From: Namhyung Kim <namhyung@...nel.org>
> Sent: Tuesday, April 23, 2024 1:59 PM
> To: Wang, Weilin <weilin.wang@...el.com>
> Cc: Ian Rogers <irogers@...gle.com>; Arnaldo Carvalho de Melo
> <acme@...nel.org>; Peter Zijlstra <peterz@...radead.org>; Ingo Molnar
> <mingo@...hat.com>; Alexander Shishkin
> <alexander.shishkin@...ux.intel.com>; Jiri Olsa <jolsa@...nel.org>; Hunter,
> Adrian <adrian.hunter@...el.com>; Kan Liang <kan.liang@...ux.intel.com>;
> linux-perf-users@...r.kernel.org; linux-kernel@...r.kernel.org; Taylor, Perry
> <perry.taylor@...el.com>; Alt, Samantha <samantha.alt@...el.com>; Biggers,
> Caleb <caleb.biggers@...el.com>
> Subject: Re: [RFC PATCH v6 2/5] perf stat: Fork and launch perf record when
> perf stat needs to get retire latency value for a metric.
>
> On Mon, Apr 1, 2024 at 2:23 PM Wang, Weilin <weilin.wang@...el.com>
> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Namhyung Kim <namhyung@...nel.org>
> > > Sent: Monday, April 1, 2024 1:58 PM
> > > To: Wang, Weilin <weilin.wang@...el.com>
> > > Cc: Ian Rogers <irogers@...gle.com>; Arnaldo Carvalho de Melo
> > > <acme@...nel.org>; Peter Zijlstra <peterz@...radead.org>; Ingo Molnar
> > > <mingo@...hat.com>; Alexander Shishkin
> > > <alexander.shishkin@...ux.intel.com>; Jiri Olsa <jolsa@...nel.org>; Hunter,
> > > Adrian <adrian.hunter@...el.com>; Kan Liang <kan.liang@...ux.intel.com>;
> > > linux-perf-users@...r.kernel.org; linux-kernel@...r.kernel.org; Taylor,
> Perry
> > > <perry.taylor@...el.com>; Alt, Samantha <samantha.alt@...el.com>;
> Biggers,
> > > Caleb <caleb.biggers@...el.com>
> > > Subject: Re: [RFC PATCH v6 2/5] perf stat: Fork and launch perf record
> when
> > > perf stat needs to get retire latency value for a metric.
> > >
> > > On Fri, Mar 29, 2024 at 12:12 PM <weilin.wang@...el.com> wrote:
> > > >
> > > > From: Weilin Wang <weilin.wang@...el.com>
> > > >
> > > > When retire_latency value is used in a metric formula, perf stat would
> fork a
> > > > perf record process with "-e" and "-W" options. Perf record will collect
> > > > required retire_latency values in parallel while perf stat is collecting
> > > > counting values.
> > > >
> > > > At the point of time that perf stat stops counting, it would send sigterm
> > > signal
> > > > to perf record process and receiving sampling data back from perf record
> > > from a
> > > > pipe. Perf stat will then process the received data to get retire latency
> data
> > > > and calculate metric result.
> > > >
> > > > Another thread is required to synchronize between perf stat and perf
> record
> > > > when we pass data through pipe.
> > > >
> > > > Signed-off-by: Weilin Wang <weilin.wang@...el.com>
> > > > Reviewed-by: Ian Rogers <irogers@...gle.com>
> > > > ---
> > > > tools/perf/builtin-stat.c | 190
> > > +++++++++++++++++++++++++++++++++-
> > > > tools/perf/util/data.c | 6 +-
> > > > tools/perf/util/metricgroup.h | 8 ++
> > > > tools/perf/util/stat.h | 2 +
> > > > 4 files changed, 203 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> > > > index 6291e1e24535..7fbe47b0c44c 100644
> > > > --- a/tools/perf/builtin-stat.c
> > > > +++ b/tools/perf/builtin-stat.c
> > > > @@ -94,8 +94,13 @@
> > > > #include <perf/evlist.h>
> > > > #include <internal/threadmap.h>
> > > >
> > > > +#include "util/sample.h"
> > > > +#include <sys/param.h>
> > > > +#include <subcmd/run-command.h>
> > > > +
> > > > #define DEFAULT_SEPARATOR " "
> > > > #define FREEZE_ON_SMI_PATH "devices/cpu/freeze_on_smi"
> > > > +#define PERF_DATA "-"
> > > >
> > > > static void print_counters(struct timespec *ts, int argc, const char
> **argv);
> > > >
> > > > @@ -163,6 +168,8 @@ static struct perf_stat_config stat_config = {
> > > > .ctl_fd_ack = -1,
> > > > .iostat_run = false,
> > > > .tpebs_events = LIST_HEAD_INIT(stat_config.tpebs_events),
> > > > + .tpebs_results = LIST_HEAD_INIT(stat_config.tpebs_results),
> > > > + .tpebs_pid = -1,
> > > > };
> > > >
> > > > static bool cpus_map_matched(struct evsel *a, struct evsel *b)
> > > > @@ -684,15 +691,155 @@ static enum counter_recovery
> > > stat_handle_error(struct evsel *counter)
> > > >
> > > > if (child_pid != -1)
> > > > kill(child_pid, SIGTERM);
> > > > + if (stat_config.tpebs_pid != -1)
> > > > + kill(stat_config.tpebs_pid, SIGTERM);
> > > > return COUNTER_FATAL;
> > > > }
> > > >
> > > > -static int __run_perf_record(void)
> > > > +static int __run_perf_record(const char **record_argv)
> > > > {
> > > > + int i = 0;
> > > > + struct tpebs_event *e;
> > > > +
> > > > pr_debug("Prepare perf record for retire_latency\n");
> > > > +
> > > > + record_argv[i++] = "perf";
> > > > + record_argv[i++] = "record";
> > > > + record_argv[i++] = "-W";
> > > > + record_argv[i++] = "--synth=no";
> > > > +
> > > > + if (stat_config.user_requested_cpu_list) {
> > > > + record_argv[i++] = "-C";
> > > > + record_argv[i++] = stat_config.user_requested_cpu_list;
> > > > + }
> > > > +
> > > > + if (stat_config.system_wide)
> > > > + record_argv[i++] = "-a";
> > > > +
> > > > + if (!stat_config.system_wide
> && !stat_config.user_requested_cpu_list)
> > > {
> > > > + pr_err("Require -a or -C option to run sampling.\n");
> > > > + return -ECANCELED;
> > > > + }
> > > > +
> > > > + list_for_each_entry(e, &stat_config.tpebs_events, nd) {
> > > > + record_argv[i++] = "-e";
> > > > + record_argv[i++] = e->name;
> > > > + }
> > > > +
> > > > + record_argv[i++] = "-o";
> > > > + record_argv[i++] = PERF_DATA;
> > > > +
> > > > return 0;
> > > > }
> > >
> > > Still I think it's weird it has 'perf record' in perf stat (despite the
> > > 'perf stat record'). If it's only Intel thing and we don't have a plan
> > > to do the same on other arches, we can move it to the arch
> > > directory and keep the perf stat code simple.
> >
> > I'm not sure what is the proper way to solve this. And Ian mentioned
> > that put code in arch directory could potentially cause other bugs.
> > So I'm wondering if we could keep this code here for now. I could work
> > on it later if we found it's better to be in arch directory.
>
> Maybe somewhere in the util/ and keep the main code minimal.
> IIUC it's only for very recent (or upcoming?) Intel CPUs and we
> don't have tests (hopefully can run on other arch/CPUs).
>
> So I don't think having it here would help fixing potential bugs.
Do you mean by creating a new file in util/ to hold this code?
Yes, this feature is for very recent Intel CPUs. It should only be triggered if
a metric uses event(s) that has the R modifier in the formula.
Thanks,
Weilin
>
> > > >
> > > > +static void prepare_run_command(struct child_process *cmd,
> > > > + const char **argv)
> > > > +{
> > > > + memset(cmd, 0, sizeof(*cmd));
> > > > + cmd->argv = argv;
> > > > + cmd->out = -1;
> > > > +}
> > > > +
> > > > +static int prepare_perf_record(struct child_process *cmd)
> > > > +{
> > > > + const char **record_argv;
> > > > + int ret;
> > > > +
> > > > + record_argv = calloc(10 + 2 * stat_config.tpebs_event_size,
> sizeof(char
> > > *));
> > > > + if (!record_argv)
> > > > + return -1;
> > > > +
> > > > + ret = __run_perf_record(record_argv);
> > > > + if (ret)
> > > > + return ret;
> > > > +
> > > > + prepare_run_command(cmd, record_argv);
> > > > + return start_command(cmd);
> > > > +}
> > > > +
> > > > +struct perf_script {
> > > > + struct perf_tool tool;
> > > > + struct perf_session *session;
> > > > +};
> > > > +
> > > > +static void tpebs_data__delete(void)
> > > > +{
> > > > + struct tpebs_retire_lat *r, *rtmp;
> > > > + struct tpebs_event *e, *etmp;
> > > > +
> > > > + list_for_each_entry_safe(r, rtmp, &stat_config.tpebs_results,
> event.nd)
> > > {
> > > > + list_del_init(&r->event.nd);
> > > > + free(r);
> > > > + }
> > > > + list_for_each_entry_safe(e, etmp, &stat_config.tpebs_events, nd) {
> > > > + list_del_init(&e->nd);
> > > > + free(e);
> > >
> > > Shouldn't it free the names?
> > >
> > >
> > > > + }
> > > > +}
> > > > +
> > > > +static int process_sample_event(struct perf_tool *tool
> __maybe_unused,
> > > > + union perf_event *event __maybe_unused,
> > > > + struct perf_sample *sample,
> > > > + struct evsel *evsel,
> > > > + struct machine *machine __maybe_unused)
> > > > +{
> > > > + int ret = 0;
> > > > + const char *evname;
> > > > + struct tpebs_retire_lat *t;
> > > > +
> > > > + evname = evsel__name(evsel);
> > > > +
> > > > + /*
> > > > + * Need to handle per core results? We are assuming average retire
> > > > + * latency value will be used. Save the number of samples and the
> sum
> > > of
> > > > + * retire latency value for each event.
> > > > + */
> > > > + list_for_each_entry(t, &stat_config.tpebs_results, event.nd) {
> > > > + if (!strcmp(evname, t->event.name)) {
> > > > + t->count += 1;
> > > > + t->sum += sample->retire_lat;
> > > > + break;
> > > > + }
> > > > + }
> > > > +
> > > > + return ret;
> > > > +}
> > > > +
> > > > +static int process_feature_event(struct perf_session *session,
> > > > + union perf_event *event)
> > > > +{
> > > > + if (event->feat.feat_id < HEADER_LAST_FEATURE)
> > > > + return perf_event__process_feature(session, event);
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static void *__cmd_script(void *arg __maybe_unused)
> > >
> > > The arg is used.
> > >
> > > Also I don't like the name 'script' as it has nothing to do with
> > > scripting. Maybe 'sample_reader', 'tpebs_reader' or
> > > 'reader_thread'?
> > >
> > >
> > > > +{
> > > > + struct child_process *cmd = arg;
> > > > + struct perf_session *session;
> > > > + struct perf_data data = {
> > > > + .mode = PERF_DATA_MODE_READ,
> > > > + .path = PERF_DATA,
> > > > + .file.fd = cmd->out,
> > > > + };
> > > > + struct perf_script script = {
> > > > + .tool = {
> > > > + .sample = process_sample_event,
> > > > + .feature = process_feature_event,
> > > > + .attr = perf_event__process_attr,
> > >
> > > Broken indentation. And if you just use the tool, you can
> > > pass it directly.
> > >
> > >
> > > > + },
> > > > + };
> > > > +
> > > > + session = perf_session__new(&data, &script.tool);
> > > > + if (IS_ERR(session))
> > > > + return NULL;
> > > > + script.session = session;
> > > > + perf_session__process_events(session);
> > > > + perf_session__delete(session);
> > > > +
> > > > + return NULL;
> > > > +}
> > > > +
> > > > static int __run_perf_stat(int argc, const char **argv, int run_idx)
> > > > {
> > > > int interval = stat_config.interval;
> > > > @@ -709,15 +856,38 @@ static int __run_perf_stat(int argc, const char
> > > **argv, int run_idx)
> > > > struct affinity saved_affinity, *affinity = NULL;
> > > > int err;
> > > > bool second_pass = false;
> > > > + struct child_process cmd;
> > > > + pthread_t thread_script;
> > > >
> > > > /* Prepare perf record for sampling event retire_latency before fork
> and
> > > > * prepare workload */
> > > > if (stat_config.tpebs_event_size > 0) {
> > > > int ret;
> > > > + struct tpebs_event *e;
> > > >
> > > > - ret = __run_perf_record();
> > > > + pr_debug("perf stat pid = %d\n", getpid());
> > > > + list_for_each_entry(e, &stat_config.tpebs_events, nd) {
> > > > + struct tpebs_retire_lat *new = malloc(sizeof(struct
> > > tpebs_retire_lat));
> > > > +
> > > > + if (!new)
> > > > + return -1;
> > > > + new->event.name = strdup(e->name);
> > > > + new->event.tpebs_name = strdup(e->tpebs_name);
> > >
> > > These can fail too.
> > >
> > >
> > > > + new->count = 0;
> > > > + new->sum = 0;
> > > > + list_add_tail(&new->event.nd, &stat_config.tpebs_results);
> > > > + }
> > > > + ret = prepare_perf_record(&cmd);
> > > > if (ret)
> > > > return ret;
> > > > + if (pthread_create(&thread_script, NULL, __cmd_script, &cmd))
> {
> > > > + kill(cmd.pid, SIGTERM);
> > > > + close(cmd.out);
> > > > + pr_err("Could not create thread to process sample data.\n");
> > > > + return -1;
> > > > + }
> > > > + /* Wait for perf record initialization a little bit.*/
> > > > + sleep(2);
> > >
> > > This won't guarantee anything. If you want to make sure the
> > > 'thread_script' to run before the 'perf record' process, you can
> > > use a pipe to signal that like in evlist__prepare_workload() and
> > > evlist__start_workload().
> >
> > This sleep is added to make perf stat wait for record initialization because in
> the
> > case that the workload runs very small a mount of time, we'd like to ensure
> perf
> > record has enough time to launch and start collecting sample data.
>
> But waiting for 2 seconds won't solve the problem.
>
> >
> > Because the code uses the common API in run-command.h to do the fork, I
> think
> > it cannot use PIPE like in evlist__prepare_workload to sync and start perf
> record
> > and perf stat data collection together. Please correct me if I'm wrong here.
>
> Ok, it'd be hard to sync both perf record and perf stat with a single workload.
> I think you can try --control option in perf record to enable/disable
> with timing
> you want. Also --synth=no should reduce a lot of overhead during
> initialization.
> Maybe you can also add --synth=no-kernel to completely skip the synthesis.
>
> Thanks,
> Namhyung
>
>
> > > > }
> > > >
> > > > if (forks) {
> > > > @@ -925,6 +1095,17 @@ static int __run_perf_stat(int argc, const char
> > > **argv, int run_idx)
> > > >
> > > > t1 = rdclock();
> > > >
> > > > + if (stat_config.tpebs_event_size > 0) {
> > > > + int ret;
> > > > +
> > > > + kill(cmd.pid, SIGTERM);
> > > > + pthread_join(thread_script, NULL);
> > > > + close(cmd.out);
> > > > + ret = finish_command(&cmd);
> > > > + if (ret != -ERR_RUN_COMMAND_WAITPID_SIGNAL)
> > > > + return ret;
> > > > + }
> > > > +
> > > > if (stat_config.walltime_run_table)
> > > > stat_config.walltime_run[run_idx] = t1 - t0;
> > > >
> > > > @@ -1032,6 +1213,9 @@ static void sig_atexit(void)
> > > > if (child_pid != -1)
> > > > kill(child_pid, SIGTERM);
> > > >
> > > > + if (stat_config.tpebs_pid != -1)
> > > > + kill(stat_config.tpebs_pid, SIGTERM);
> > > > +
> > > > sigprocmask(SIG_SETMASK, &oset, NULL);
> > > >
> > > > if (signr == -1)
> > > > @@ -2972,5 +3156,7 @@ int cmd_stat(int argc, const char **argv)
> > > > metricgroup__rblist_exit(&stat_config.metric_events);
> > > > evlist__close_control(stat_config.ctl_fd, stat_config.ctl_fd_ack,
> > > &stat_config.ctl_fd_close);
> > > >
> > > > + tpebs_data__delete();
> > > > +
> > > > return status;
> > > > }
> > > > diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c
> > > > index 08c4bfbd817f..98e3014c0aef 100644
> > > > --- a/tools/perf/util/data.c
> > > > +++ b/tools/perf/util/data.c
> > > > @@ -204,7 +204,11 @@ static bool check_pipe(struct perf_data *data)
> > > > data->file.fd = fd;
> > > > data->use_stdio = false;
> > > > }
> > > > - } else {
> > > > + /*
> > > > + * When is_pipe and data->file.fd is given, use given fd
> > > > + * instead of STDIN_FILENO or STDOUT_FILENO
> > > > + */
> > > > + } else if (data->file.fd <= 0) {
> > > > data->file.fd = fd;
> > > > }
> > > > }
> > >
> > > I think this can be in a separate commit.
> > >
> > >
> > > > diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h
> > > > index 7c24ed768ff3..ae788edef30f 100644
> > > > --- a/tools/perf/util/metricgroup.h
> > > > +++ b/tools/perf/util/metricgroup.h
> > > > @@ -68,10 +68,18 @@ struct metric_expr {
> > > >
> > > > struct tpebs_event {
> > > > struct list_head nd;
> > > > + /* Event name */
> > > > const char *name;
> > > > + /* Event name with the TPEBS modifier R */
> > > > const char *tpebs_name;
> > > > };
> > > >
> > > > +struct tpebs_retire_lat {
> > > > + struct tpebs_event event;
> > > > + size_t count;
> > > > + int sum;
> > > > +};
> > >
> > > Actually I don't know why you need this separate structure.
> > > Can we just use tpebs_event?
> >
> > Currently, we use average value as the retire latency value in metrics. But we
> > might update it to use other value, for example the minimum or maximum.
> So, I thought
> > it would be better to have a dedicated data structure to handle this data. I
> could update
> > the code to use tpebs_event if you still feel that's better.
> >
> > Thanks,
> > Weilin
> >
> > >
> > > Thanks,
> > > Namhyung
> > >
> > >
> > > > +
> > > > struct metric_event *metricgroup__lookup(struct rblist *metric_events,
> > > > struct evsel *evsel,
> > > > bool create);
> > > > diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
> > > > index b987960df3c5..0726bdc06681 100644
> > > > --- a/tools/perf/util/stat.h
> > > > +++ b/tools/perf/util/stat.h
> > > > @@ -111,6 +111,8 @@ struct perf_stat_config {
> > > > struct rblist metric_events;
> > > > struct list_head tpebs_events;
> > > > size_t tpebs_event_size;
> > > > + struct list_head tpebs_results;
> > > > + pid_t tpebs_pid;
> > > > int ctl_fd;
> > > > int ctl_fd_ack;
> > > > bool ctl_fd_close;
> > > > --
> > > > 2.43.0
> > > >
Powered by blists - more mailing lists