linux-kernel - Re: [PATCH v2] perf stat: Introduce skippable evsels

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP-5=fW2aAijt8tqydszQHQFmsfeQO2S0hb7Z27RtXxG4Zmm-w@mail.gmail.com>
Date:   Mon, 17 Apr 2023 08:59:23 -0700
From:   Ian Rogers <irogers@...gle.com>
To:     "Liang, Kan" <kan.liang@...ux.intel.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Mark Rutland <mark.rutland@....com>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Jiri Olsa <jolsa@...nel.org>,
        Namhyung Kim <namhyung@...nel.org>,
        Adrian Hunter <adrian.hunter@...el.com>,
        Florian Fischer <florian.fischer@...q.space>,
        linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] perf stat: Introduce skippable evsels

On Mon, Apr 17, 2023 at 6:58 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
>
>
>
> On 2023-04-14 7:03 p.m., Ian Rogers wrote:
> > On Fri, Apr 14, 2023 at 11:07 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
> >>
> >>
> >>
> >> On 2023-04-14 1:19 a.m., Ian Rogers wrote:
> >>> Perf stat with no arguments will use default events and metrics. These
> >>> events may fail to open even with kernel and hypervisor disabled. When
> >>> these fail then the permissions error appears even though they were
> >>> implicitly selected. This is particularly a problem with the automatic
> >>> selection of the TopdownL1 metric group on certain architectures like
> >>> Skylake:
> >>>
> >>> ```
> >>> $ perf stat true
> >>> Error:
> >>> Access to performance monitoring and observability operations is limited.
> >>> Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
> >>> access to performance monitoring and observability operations for processes
> >>> without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
> >>> More information can be found at 'Perf events and tool security' document:
> >>> https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
> >>> perf_event_paranoid setting is 2:
> >>>   -1: Allow use of (almost) all events by all users
> >>>       Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
> >>>> = 0: Disallow raw and ftrace function tracepoint access
> >>>> = 1: Disallow CPU event access
> >>>> = 2: Disallow kernel profiling
> >>> To make the adjusted perf_event_paranoid setting permanent preserve it
> >>> in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)
> >>> ```
> >>>
> >>> This patch adds skippable evsels that when they fail to open won't
> >>> fail and won't appear in output. The TopdownL1 events, from the metric
> >>> group, are marked as skippable. This turns the failure above to:
> >>>
> >>> ```
> >>> $ perf stat true
> >>>
> >>>  Performance counter stats for 'true':
> >>>
> >>>               1.26 msec task-clock:u                     #    0.328 CPUs utilized
> >>>                  0      context-switches:u               #    0.000 /sec
> >>>                  0      cpu-migrations:u                 #    0.000 /sec
> >>>                 49      page-faults:u                    #   38.930 K/sec
> >>>            176,449      cycles:u                         #    0.140 GHz                         (48.99%)
> >>
> >> Multiplexing?
> >>
> >> Thanks,
> >> Kan
> >
> > I may have been running a test in the background otherwise I can't
> > explain it. Repeating the test yields no multiplexing:
>
>
> The above multiplexing should be on a Skylake (since there is no
> topdownL1 printed), but the test which you repeat seems on a Tigerlake
> (has topdownL1). Could you please double check on a Skylake?

In the best circumstances (ie no EBS_Mode, no other events, nmi
watchdog disabled) Skylake has multiplexing for TopdownL1:

```
$ sudo perf stat --metric-no-group -M TopdownL1 --metric-no-group -a sleep 1

Performance counter stats for 'system wide':

      500,145,019      INST_RETIRED.ANY                 #     14.2 %
tma_retiring             (71.07%)
    2,402,640,337      CPU_CLK_UNHALTED.THREAD_ANY      #     41.1 %
tma_frontend_bound
                                                 #     36.2 %
tma_backend_bound
                                                 #      8.4 %
tma_bad_speculation      (85.63%)
    1,976,769,761      IDQ_UOPS_NOT_DELIVERED.CORE
                        (85.81%)
      114,069,133      INT_MISC.RECOVERY_CYCLES_ANY
                        (85.83%)
      684,605,487      UOPS_RETIRED.RETIRE_SLOTS
                        (85.83%)
       49,695,823      UOPS_RETIRED.MACRO_FUSED
                        (85.83%)
      860,603,122      UOPS_ISSUED.ANY
                        (56.70%)

      1.014011174 seconds time elapsed
```

but this isn't a regression:
https://lore.kernel.org/lkml/20200520072814.128267-1-irogers@google.com/

I think this is off-topic for this change.

Thanks,
Ian

> Thanks,
> Kan
> >
> > ```
> > $ perf stat true
> >
> > Performance counter stats for 'true':
> >
> >              0.78 msec task-clock:u                     #    0.383
> > CPUs utilized
> >                 0      context-switches:u               #    0.000
> > /sec
> >                 0      cpu-migrations:u                 #    0.000
> > /sec
> >                47      page-faults:u                    #   60.174
> > K/sec
> >           233,420      cycles:u                         #    0.299 GHz
> >           133,318      instructions:u                   #    0.57
> > insn per cycle
> >            31,396      branches:u                       #   40.196
> > M/sec
> >             2,334      branch-misses:u                  #    7.43% of
> > all branches
> >         1,167,100      TOPDOWN.SLOTS:u                  #     12.2 %
> > tma_retiring
> >                                                  #     28.9 %
> > tma_backend_bound
> >                                                  #     41.0 %
> > tma_frontend_bound
> >                                                  #     18.0 %
> > tma_bad_speculation
> >           141,882      topdown-retiring:u
> >           480,570      topdown-fe-bound:u
> >           320,380      topdown-be-bound:u
> >           224,266      topdown-bad-spec:u
> >             2,173      INT_MISC.UOP_DROPPING:u          #    2.782
> > M/sec
> >             3,323      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #
> >  4.254 M/sec
> >
> >
> >       0.002036744 seconds time elapsed
> >
> >       0.002252000 seconds user
> >       0.000000000 seconds sys
> > ```
> >
> > Thanks,
> > Ian
> >
> >>>            122,905      instructions:u                   #    0.70  insn per cycle
> >>>             28,264      branches:u                       #   22.456 M/sec
> >>>              2,405      branch-misses:u                  #    8.51% of all branches
> >>>
> >>>        0.003834565 seconds time elapsed
> >>>
> >>>        0.000000000 seconds user
> >>>        0.004130000 seconds sys
> >>> ```
> >>>
> >>> When the events can have kernel/hypervisor disabled, like on
> >>> Tigerlake, then it continues to succeed as:
> >>>
> >>> ```
> >>> $ perf stat true
> >>>
> >>>  Performance counter stats for 'true':
> >>>
> >>>               0.57 msec task-clock:u                     #    0.385 CPUs utilized
> >>>                  0      context-switches:u               #    0.000 /sec
> >>>                  0      cpu-migrations:u                 #    0.000 /sec
> >>>                 47      page-faults:u                    #   82.329 K/sec
> >>>            287,017      cycles:u                         #    0.503 GHz
> >>>            133,318      instructions:u                   #    0.46  insn per cycle
> >>>             31,396      branches:u                       #   54.996 M/sec
> >>>              2,442      branch-misses:u                  #    7.78% of all branches
> >>>            998,790      TOPDOWN.SLOTS:u                  #     14.5 %  tma_retiring
> >>>                                                   #     27.6 %  tma_backend_bound
> >>>                                                   #     40.9 %  tma_frontend_bound
> >>>                                                   #     17.0 %  tma_bad_speculation
> >>>            144,922      topdown-retiring:u
> >>>            411,266      topdown-fe-bound:u
> >>>            258,510      topdown-be-bound:u
> >>>            184,090      topdown-bad-spec:u
> >>>              2,585      INT_MISC.UOP_DROPPING:u          #    4.528 M/sec
> >>>              3,434      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #    6.015 M/sec
> >>>
> >>>        0.001480954 seconds time elapsed
> >>>
> >>>        0.000000000 seconds user
> >>>        0.001686000 seconds sys
> >>> ```
> >>>
> >>> And this likewise works if paranoia allows or running as root.
> >>>
> >>> v2. Don't display the skipped events as <not counted> or <not supported>.
> >>>
> >>> Signed-off-by: Ian Rogers <irogers@...gle.com>
> >>> ---
> >>>  tools/perf/builtin-stat.c      | 39 ++++++++++++++++++++++++++--------
> >>>  tools/perf/util/evsel.c        | 15 +++++++++++--
> >>>  tools/perf/util/evsel.h        |  1 +
> >>>  tools/perf/util/stat-display.c |  4 ++++
> >>>  4 files changed, 48 insertions(+), 11 deletions(-)
> >>>
> >>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> >>> index d3cbee7460fc..7a641a67486d 100644
> >>> --- a/tools/perf/builtin-stat.c
> >>> +++ b/tools/perf/builtin-stat.c
> >>> @@ -667,6 +667,13 @@ static enum counter_recovery stat_handle_error(struct evsel *counter)
> >>>                       evsel_list->core.threads->err_thread = -1;
> >>>                       return COUNTER_RETRY;
> >>>               }
> >>> +     } else if (counter->skippable) {
> >>> +             if (verbose > 0)
> >>> +                     ui__warning("skipping event %s that kernel failed to open .\n",
> >>> +                                 evsel__name(counter));
> >>> +             counter->supported = false;
> >>> +             counter->errored = true;
> >>> +             return COUNTER_SKIP;
> >>>       }
> >>>
> >>>       evsel__open_strerror(counter, &target, errno, msg, sizeof(msg));
> >>> @@ -1885,15 +1892,29 @@ static int add_default_attributes(void)
> >>>                * Add TopdownL1 metrics if they exist. To minimize
> >>>                * multiplexing, don't request threshold computation.
> >>>                */
> >>> -             if (metricgroup__has_metric("TopdownL1") &&
> >>> -                 metricgroup__parse_groups(evsel_list, "TopdownL1",
> >>> -                                         /*metric_no_group=*/false,
> >>> -                                         /*metric_no_merge=*/false,
> >>> -                                         /*metric_no_threshold=*/true,
> >>> -                                         stat_config.user_requested_cpu_list,
> >>> -                                         stat_config.system_wide,
> >>> -                                         &stat_config.metric_events) < 0)
> >>> -                     return -1;
> >>> +             if (metricgroup__has_metric("TopdownL1")) {
> >>> +                     struct evlist *metric_evlist = evlist__new();
> >>> +                     struct evsel *metric_evsel;
> >>> +
> >>> +                     if (!metric_evlist)
> >>> +                             return -1;
> >>> +
> >>> +                     if (metricgroup__parse_groups(metric_evlist, "TopdownL1",
> >>> +                                                     /*metric_no_group=*/false,
> >>> +                                                     /*metric_no_merge=*/false,
> >>> +                                                     /*metric_no_threshold=*/true,
> >>> +                                                     stat_config.user_requested_cpu_list,
> >>> +                                                     stat_config.system_wide,
> >>> +                                                     &stat_config.metric_events) < 0)
> >>> +                             return -1;
> >>> +
> >>> +                     evlist__for_each_entry(metric_evlist, metric_evsel) {
> >>> +                             metric_evsel->skippable = true;
> >>> +                     }
> >>> +                     evlist__splice_list_tail(evsel_list, &metric_evlist->core.entries);
> >>> +                     evlist__delete(metric_evlist);
> >>> +             }
> >>> +
> >>>               /* Platform specific attrs */
> >>>               if (evlist__add_default_attrs(evsel_list, default_null_attrs) < 0)
> >>>                       return -1;
> >>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >>> index a85a987128aa..83a65f771666 100644
> >>> --- a/tools/perf/util/evsel.c
> >>> +++ b/tools/perf/util/evsel.c
> >>> @@ -290,6 +290,7 @@ void evsel__init(struct evsel *evsel,
> >>>       evsel->per_pkg_mask  = NULL;
> >>>       evsel->collect_stat  = false;
> >>>       evsel->pmu_name      = NULL;
> >>> +     evsel->skippable     = false;
> >>>  }
> >>>
> >>>  struct evsel *evsel__new_idx(struct perf_event_attr *attr, int idx)
> >>> @@ -1720,9 +1721,13 @@ static int get_group_fd(struct evsel *evsel, int cpu_map_idx, int thread)
> >>>               return -1;
> >>>
> >>>       fd = FD(leader, cpu_map_idx, thread);
> >>> -     BUG_ON(fd == -1);
> >>> +     BUG_ON(fd == -1 && !leader->skippable);
> >>>
> >>> -     return fd;
> >>> +     /*
> >>> +      * When the leader has been skipped, return -2 to distinguish from no
> >>> +      * group leader case.
> >>> +      */
> >>> +     return fd == -1 ? -2 : fd;
> >>>  }
> >>>
> >>>  static void evsel__remove_fd(struct evsel *pos, int nr_cpus, int nr_threads, int thread_idx)
> >>> @@ -2104,6 +2109,12 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
> >>>
> >>>                       group_fd = get_group_fd(evsel, idx, thread);
> >>>
> >>> +                     if (group_fd == -2) {
> >>> +                             pr_debug("broken group leader for %s\n", evsel->name);
> >>> +                             err = -EINVAL;
> >>> +                             goto out_close;
> >>> +                     }
> >>> +
> >>>                       test_attr__ready();
> >>>
> >>>                       /* Debug message used by test scripts */
> >>> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> >>> index 68072ec655ce..98afe3351176 100644
> >>> --- a/tools/perf/util/evsel.h
> >>> +++ b/tools/perf/util/evsel.h
> >>> @@ -95,6 +95,7 @@ struct evsel {
> >>>               bool                    weak_group;
> >>>               bool                    bpf_counter;
> >>>               bool                    use_config_name;
> >>> +             bool                    skippable;
> >>>               int                     bpf_fd;
> >>>               struct bpf_object       *bpf_obj;
> >>>               struct list_head        config_terms;
> >>> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
> >>> index e6035ecbeee8..6b46bbb3d322 100644
> >>> --- a/tools/perf/util/stat-display.c
> >>> +++ b/tools/perf/util/stat-display.c
> >>> @@ -810,6 +810,10 @@ static bool should_skip_zero_counter(struct perf_stat_config *config,
> >>>       struct perf_cpu cpu;
> >>>       int idx;
> >>>
> >>> +     /* Skip counters that were speculatively/default enabled rather than requested. */
> >>> +     if (counter->skippable)
> >>> +             return true;
> >>> +
> >>>       /*
> >>>        * Skip value 0 when enabling --per-thread globally,
> >>>        * otherwise it will have too many 0 output.