lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP-5=fWRy4NEqhB6-b98+m7SV5=oyBOMVuOHwmvKZCJuXcsQEg@mail.gmail.com>
Date:   Mon, 17 Apr 2023 11:13:55 -0700
From:   Ian Rogers <irogers@...gle.com>
To:     "Liang, Kan" <kan.liang@...ux.intel.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Mark Rutland <mark.rutland@....com>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Jiri Olsa <jolsa@...nel.org>,
        Namhyung Kim <namhyung@...nel.org>,
        Adrian Hunter <adrian.hunter@...el.com>,
        Florian Fischer <florian.fischer@...q.space>,
        linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] perf stat: Introduce skippable evsels

On Mon, Apr 17, 2023 at 10:31 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
>
>
>
> On 2023-04-17 11:59 a.m., Ian Rogers wrote:
> > On Mon, Apr 17, 2023 at 6:58 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
> >>
> >>
> >>
> >> On 2023-04-14 7:03 p.m., Ian Rogers wrote:
> >>> On Fri, Apr 14, 2023 at 11:07 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2023-04-14 1:19 a.m., Ian Rogers wrote:
> >>>>> Perf stat with no arguments will use default events and metrics. These
> >>>>> events may fail to open even with kernel and hypervisor disabled. When
> >>>>> these fail then the permissions error appears even though they were
> >>>>> implicitly selected. This is particularly a problem with the automatic
> >>>>> selection of the TopdownL1 metric group on certain architectures like
> >>>>> Skylake:
> >>>>>
> >>>>> ```
> >>>>> $ perf stat true
> >>>>> Error:
> >>>>> Access to performance monitoring and observability operations is limited.
> >>>>> Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
> >>>>> access to performance monitoring and observability operations for processes
> >>>>> without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
> >>>>> More information can be found at 'Perf events and tool security' document:
> >>>>> https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
> >>>>> perf_event_paranoid setting is 2:
> >>>>>   -1: Allow use of (almost) all events by all users
> >>>>>       Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
> >>>>>> = 0: Disallow raw and ftrace function tracepoint access
> >>>>>> = 1: Disallow CPU event access
> >>>>>> = 2: Disallow kernel profiling
> >>>>> To make the adjusted perf_event_paranoid setting permanent preserve it
> >>>>> in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)
> >>>>> ```
> >>>>>
> >>>>> This patch adds skippable evsels that when they fail to open won't
> >>>>> fail and won't appear in output. The TopdownL1 events, from the metric
> >>>>> group, are marked as skippable. This turns the failure above to:
> >>>>>
> >>>>> ```
> >>>>> $ perf stat true
> >>>>>
> >>>>>  Performance counter stats for 'true':
> >>>>>
> >>>>>               1.26 msec task-clock:u                     #    0.328 CPUs utilized
> >>>>>                  0      context-switches:u               #    0.000 /sec
> >>>>>                  0      cpu-migrations:u                 #    0.000 /sec
> >>>>>                 49      page-faults:u                    #   38.930 K/sec
> >>>>>            176,449      cycles:u                         #    0.140 GHz                         (48.99%)
> >>>>
> >>>> Multiplexing?
> >>>>
> >>>> Thanks,
> >>>> Kan
> >>>
> >>> I may have been running a test in the background otherwise I can't
> >>> explain it. Repeating the test yields no multiplexing:
> >>
> >>
> >> The above multiplexing should be on a Skylake (since there is no
> >> topdownL1 printed), but the test which you repeat seems on a Tigerlake
> >> (has topdownL1). Could you please double check on a Skylake?
> >
> > In the best circumstances (ie no EBS_Mode, no other events, nmi
> > watchdog disabled) Skylake has multiplexing for TopdownL1:
> >
>
> No I mean the perf stat default on Skylake.
>
> With this patch, on a Skylake machine, there should be no TopdownL1
> event and no multiplexing.

There are no top down events in:
```
$ perf stat true

 Performance counter stats for 'true':

              1.26 msec task-clock:u                     #    0.328
CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
                49      page-faults:u                    #   38.930 K/sec
           176,449      cycles:u                         #    0.140
GHz                         (48.99%)
           122,905      instructions:u                   #    0.70
insn per cycle
            28,264      branches:u                       #   22.456 M/sec
             2,405      branch-misses:u                  #    8.51% of
all branches

       0.003834565 seconds time elapsed

       0.000000000 seconds user
       0.004130000 seconds sys
```
The json TopdownL1 is enabled if present unconditionally for perf stat
default. Enabling it on Skylake has multiplexing as TopdownL1 on
Skylake has multiplexing unrelated to this change - at least on the
machine I was testing on. We can remove the metric group TopdownL1 on
Skylake so that we don't enable it by default, there is still the
group TmaL1. To me, disabling TopdownL1 seems less desirable than
running with multiplexing - previously to get into topdown analysis
there has to be knowledge that "perf stat -M TopdownL1" is the way to
do this.

This doesn't relate to this change which is about making it so that
failing to set up TopdownL1 doesn't cause an early exit. The reason I
showed TigerLake output was that on TigerLake the skip/fallback
approach works while Skylake just needs the events disabled/skipped
unless it has sufficient permissions. Note the :u on the events in:

```
$ perf stat true

 Performance counter stats for 'true':

              0.57 msec task-clock:u                     #    0.385
CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
                47      page-faults:u                    #   82.329 K/sec
           287,017      cycles:u                         #    0.503 GHz
           133,318      instructions:u                   #    0.46
insn per cycle
            31,396      branches:u                       #   54.996 M/sec
             2,442      branch-misses:u                  #    7.78% of
all branches
           998,790      TOPDOWN.SLOTS:u                  #     14.5 %
tma_retiring
                                                  #     27.6 %
tma_backend_bound
                                                  #     40.9 %
tma_frontend_bound
                                                  #     17.0 %
tma_bad_speculation
           144,922      topdown-retiring:u
           411,266      topdown-fe-bound:u
           258,510      topdown-be-bound:u
           184,090      topdown-bad-spec:u
             2,585      INT_MISC.UOP_DROPPING:u          #    4.528 M/sec
             3,434      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #
  6.015 M/sec

       0.001480954 seconds time elapsed

       0.000000000 seconds user
       0.001686000 seconds sys
```

> From your test result in the v2 description, we can see that there is no
> TopdownL1, which is good and expected. However, there is a (48.99%) with
> cycles:u event, which implies multiplexing. Could you please check
> what's the problem here?
> Also, if it's because of the backgroud, all the events should be
> multiplexing. But it looks like only cycle:u has multiplexing. Other
> events, instructions:u, branches:u and branch-misses:u work without
> multiplexing. That's very strange.

I wasn't able to reproduce it and suspect it was a transient thing. I
think there are multiplexing things to look into, 2 events on a fixed
counter on Icelake+ will trigger multiplexing on all counters, and
Skylake's 3 fixed and 4 generic should fit TopdownL1.

Thanks,
Ian

> > ```
> > $ sudo perf stat --metric-no-group -M TopdownL1 --metric-no-group -a sleep 1
> >
> > Performance counter stats for 'system wide':
> >
> >       500,145,019      INST_RETIRED.ANY                 #     14.2 %
> > tma_retiring             (71.07%)
> >     2,402,640,337      CPU_CLK_UNHALTED.THREAD_ANY      #     41.1 %
> > tma_frontend_bound
> >                                                  #     36.2 %
> > tma_backend_bound
> >                                                  #      8.4 %
> > tma_bad_speculation      (85.63%)
> >     1,976,769,761      IDQ_UOPS_NOT_DELIVERED.CORE
> >                         (85.81%)
> >       114,069,133      INT_MISC.RECOVERY_CYCLES_ANY
> >                         (85.83%)
> >       684,605,487      UOPS_RETIRED.RETIRE_SLOTS
> >                         (85.83%)
> >        49,695,823      UOPS_RETIRED.MACRO_FUSED
> >                         (85.83%)
> >       860,603,122      UOPS_ISSUED.ANY
> >                         (56.70%)
> >
> >       1.014011174 seconds time elapsed
> > ```
> >
> > but this isn't a regression:
> > https://lore.kernel.org/lkml/20200520072814.128267-1-irogers@google.com/
> >
> > I think this is off-topic for this change.
> >
> > Thanks,
> > Ian
> >
> >> Thanks,
> >> Kan
> >>>
> >>> ```
> >>> $ perf stat true
> >>>
> >>> Performance counter stats for 'true':
> >>>
> >>>              0.78 msec task-clock:u                     #    0.383
> >>> CPUs utilized
> >>>                 0      context-switches:u               #    0.000
> >>> /sec
> >>>                 0      cpu-migrations:u                 #    0.000
> >>> /sec
> >>>                47      page-faults:u                    #   60.174
> >>> K/sec
> >>>           233,420      cycles:u                         #    0.299 GHz
> >>>           133,318      instructions:u                   #    0.57
> >>> insn per cycle
> >>>            31,396      branches:u                       #   40.196
> >>> M/sec
> >>>             2,334      branch-misses:u                  #    7.43% of
> >>> all branches
> >>>         1,167,100      TOPDOWN.SLOTS:u                  #     12.2 %
> >>> tma_retiring
> >>>                                                  #     28.9 %
> >>> tma_backend_bound
> >>>                                                  #     41.0 %
> >>> tma_frontend_bound
> >>>                                                  #     18.0 %
> >>> tma_bad_speculation
> >>>           141,882      topdown-retiring:u
> >>>           480,570      topdown-fe-bound:u
> >>>           320,380      topdown-be-bound:u
> >>>           224,266      topdown-bad-spec:u
> >>>             2,173      INT_MISC.UOP_DROPPING:u          #    2.782
> >>> M/sec
> >>>             3,323      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #
> >>>  4.254 M/sec
> >>>
> >>>
> >>>       0.002036744 seconds time elapsed
> >>>
> >>>       0.002252000 seconds user
> >>>       0.000000000 seconds sys
> >>> ```
> >>>
>
> This is apparently on a TigerLake, which has TopdownL1 with the perf
> stat default. I don't have a problem with this.
>
> Thanks,
> Kan
>
> >>> Thanks,
> >>> Ian
> >>>
> >>>>>            122,905      instructions:u                   #    0.70  insn per cycle
> >>>>>             28,264      branches:u                       #   22.456 M/sec
> >>>>>              2,405      branch-misses:u                  #    8.51% of all branches
> >>>>>
> >>>>>        0.003834565 seconds time elapsed
> >>>>>
> >>>>>        0.000000000 seconds user
> >>>>>        0.004130000 seconds sys
> >>>>> ```
> >>>>>
> >>>>> When the events can have kernel/hypervisor disabled, like on
> >>>>> Tigerlake, then it continues to succeed as:
> >>>>>
> >>>>> ```
> >>>>> $ perf stat true
> >>>>>
> >>>>>  Performance counter stats for 'true':
> >>>>>
> >>>>>               0.57 msec task-clock:u                     #    0.385 CPUs utilized
> >>>>>                  0      context-switches:u               #    0.000 /sec
> >>>>>                  0      cpu-migrations:u                 #    0.000 /sec
> >>>>>                 47      page-faults:u                    #   82.329 K/sec
> >>>>>            287,017      cycles:u                         #    0.503 GHz
> >>>>>            133,318      instructions:u                   #    0.46  insn per cycle
> >>>>>             31,396      branches:u                       #   54.996 M/sec
> >>>>>              2,442      branch-misses:u                  #    7.78% of all branches
> >>>>>            998,790      TOPDOWN.SLOTS:u                  #     14.5 %  tma_retiring
> >>>>>                                                   #     27.6 %  tma_backend_bound
> >>>>>                                                   #     40.9 %  tma_frontend_bound
> >>>>>                                                   #     17.0 %  tma_bad_speculation
> >>>>>            144,922      topdown-retiring:u
> >>>>>            411,266      topdown-fe-bound:u
> >>>>>            258,510      topdown-be-bound:u
> >>>>>            184,090      topdown-bad-spec:u
> >>>>>              2,585      INT_MISC.UOP_DROPPING:u          #    4.528 M/sec
> >>>>>              3,434      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #    6.015 M/sec
> >>>>>
> >>>>>        0.001480954 seconds time elapsed
> >>>>>
> >>>>>        0.000000000 seconds user
> >>>>>        0.001686000 seconds sys
> >>>>> ```
> >>>>>
> >>>>> And this likewise works if paranoia allows or running as root.
> >>>>>
> >>>>> v2. Don't display the skipped events as <not counted> or <not supported>.
> >>>>>
> >>>>> Signed-off-by: Ian Rogers <irogers@...gle.com>
> >>>>> ---
> >>>>>  tools/perf/builtin-stat.c      | 39 ++++++++++++++++++++++++++--------
> >>>>>  tools/perf/util/evsel.c        | 15 +++++++++++--
> >>>>>  tools/perf/util/evsel.h        |  1 +
> >>>>>  tools/perf/util/stat-display.c |  4 ++++
> >>>>>  4 files changed, 48 insertions(+), 11 deletions(-)
> >>>>>
> >>>>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> >>>>> index d3cbee7460fc..7a641a67486d 100644
> >>>>> --- a/tools/perf/builtin-stat.c
> >>>>> +++ b/tools/perf/builtin-stat.c
> >>>>> @@ -667,6 +667,13 @@ static enum counter_recovery stat_handle_error(struct evsel *counter)
> >>>>>                       evsel_list->core.threads->err_thread = -1;
> >>>>>                       return COUNTER_RETRY;
> >>>>>               }
> >>>>> +     } else if (counter->skippable) {
> >>>>> +             if (verbose > 0)
> >>>>> +                     ui__warning("skipping event %s that kernel failed to open .\n",
> >>>>> +                                 evsel__name(counter));
> >>>>> +             counter->supported = false;
> >>>>> +             counter->errored = true;
> >>>>> +             return COUNTER_SKIP;
> >>>>>       }
> >>>>>
> >>>>>       evsel__open_strerror(counter, &target, errno, msg, sizeof(msg));
> >>>>> @@ -1885,15 +1892,29 @@ static int add_default_attributes(void)
> >>>>>                * Add TopdownL1 metrics if they exist. To minimize
> >>>>>                * multiplexing, don't request threshold computation.
> >>>>>                */
> >>>>> -             if (metricgroup__has_metric("TopdownL1") &&
> >>>>> -                 metricgroup__parse_groups(evsel_list, "TopdownL1",
> >>>>> -                                         /*metric_no_group=*/false,
> >>>>> -                                         /*metric_no_merge=*/false,
> >>>>> -                                         /*metric_no_threshold=*/true,
> >>>>> -                                         stat_config.user_requested_cpu_list,
> >>>>> -                                         stat_config.system_wide,
> >>>>> -                                         &stat_config.metric_events) < 0)
> >>>>> -                     return -1;
> >>>>> +             if (metricgroup__has_metric("TopdownL1")) {
> >>>>> +                     struct evlist *metric_evlist = evlist__new();
> >>>>> +                     struct evsel *metric_evsel;
> >>>>> +
> >>>>> +                     if (!metric_evlist)
> >>>>> +                             return -1;
> >>>>> +
> >>>>> +                     if (metricgroup__parse_groups(metric_evlist, "TopdownL1",
> >>>>> +                                                     /*metric_no_group=*/false,
> >>>>> +                                                     /*metric_no_merge=*/false,
> >>>>> +                                                     /*metric_no_threshold=*/true,
> >>>>> +                                                     stat_config.user_requested_cpu_list,
> >>>>> +                                                     stat_config.system_wide,
> >>>>> +                                                     &stat_config.metric_events) < 0)
> >>>>> +                             return -1;
> >>>>> +
> >>>>> +                     evlist__for_each_entry(metric_evlist, metric_evsel) {
> >>>>> +                             metric_evsel->skippable = true;
> >>>>> +                     }
> >>>>> +                     evlist__splice_list_tail(evsel_list, &metric_evlist->core.entries);
> >>>>> +                     evlist__delete(metric_evlist);
> >>>>> +             }
> >>>>> +
> >>>>>               /* Platform specific attrs */
> >>>>>               if (evlist__add_default_attrs(evsel_list, default_null_attrs) < 0)
> >>>>>                       return -1;
> >>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >>>>> index a85a987128aa..83a65f771666 100644
> >>>>> --- a/tools/perf/util/evsel.c
> >>>>> +++ b/tools/perf/util/evsel.c
> >>>>> @@ -290,6 +290,7 @@ void evsel__init(struct evsel *evsel,
> >>>>>       evsel->per_pkg_mask  = NULL;
> >>>>>       evsel->collect_stat  = false;
> >>>>>       evsel->pmu_name      = NULL;
> >>>>> +     evsel->skippable     = false;
> >>>>>  }
> >>>>>
> >>>>>  struct evsel *evsel__new_idx(struct perf_event_attr *attr, int idx)
> >>>>> @@ -1720,9 +1721,13 @@ static int get_group_fd(struct evsel *evsel, int cpu_map_idx, int thread)
> >>>>>               return -1;
> >>>>>
> >>>>>       fd = FD(leader, cpu_map_idx, thread);
> >>>>> -     BUG_ON(fd == -1);
> >>>>> +     BUG_ON(fd == -1 && !leader->skippable);
> >>>>>
> >>>>> -     return fd;
> >>>>> +     /*
> >>>>> +      * When the leader has been skipped, return -2 to distinguish from no
> >>>>> +      * group leader case.
> >>>>> +      */
> >>>>> +     return fd == -1 ? -2 : fd;
> >>>>>  }
> >>>>>
> >>>>>  static void evsel__remove_fd(struct evsel *pos, int nr_cpus, int nr_threads, int thread_idx)
> >>>>> @@ -2104,6 +2109,12 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
> >>>>>
> >>>>>                       group_fd = get_group_fd(evsel, idx, thread);
> >>>>>
> >>>>> +                     if (group_fd == -2) {
> >>>>> +                             pr_debug("broken group leader for %s\n", evsel->name);
> >>>>> +                             err = -EINVAL;
> >>>>> +                             goto out_close;
> >>>>> +                     }
> >>>>> +
> >>>>>                       test_attr__ready();
> >>>>>
> >>>>>                       /* Debug message used by test scripts */
> >>>>> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> >>>>> index 68072ec655ce..98afe3351176 100644
> >>>>> --- a/tools/perf/util/evsel.h
> >>>>> +++ b/tools/perf/util/evsel.h
> >>>>> @@ -95,6 +95,7 @@ struct evsel {
> >>>>>               bool                    weak_group;
> >>>>>               bool                    bpf_counter;
> >>>>>               bool                    use_config_name;
> >>>>> +             bool                    skippable;
> >>>>>               int                     bpf_fd;
> >>>>>               struct bpf_object       *bpf_obj;
> >>>>>               struct list_head        config_terms;
> >>>>> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
> >>>>> index e6035ecbeee8..6b46bbb3d322 100644
> >>>>> --- a/tools/perf/util/stat-display.c
> >>>>> +++ b/tools/perf/util/stat-display.c
> >>>>> @@ -810,6 +810,10 @@ static bool should_skip_zero_counter(struct perf_stat_config *config,
> >>>>>       struct perf_cpu cpu;
> >>>>>       int idx;
> >>>>>
> >>>>> +     /* Skip counters that were speculatively/default enabled rather than requested. */
> >>>>> +     if (counter->skippable)
> >>>>> +             return true;
> >>>>> +
> >>>>>       /*
> >>>>>        * Skip value 0 when enabling --per-thread globally,
> >>>>>        * otherwise it will have too many 0 output.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ