lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP-5=fUoD=s9yyVPgV7tqGwZsJVQMSmHKd8MV_vJW438AcK9qQ@mail.gmail.com>
Date:   Wed, 6 Dec 2023 08:35:23 -0800
From:   Ian Rogers <irogers@...gle.com>
To:     Arnaldo Carvalho de Melo <acme@...nel.org>
Cc:     Ayush Jain <ayush.jain3@....com>,
        Sandipan Das <sandipan.das@....com>,
        linux-kernel@...r.kernel.org, linux-perf-users@...r.kernel.org,
        peterz@...radead.org, Ingo Molnar <mingo@...nel.org>,
        mark.rutland@....com, alexander.shishkin@...ux.intel.com,
        Jiri Olsa <jolsa@...nel.org>,
        Namhyung Kim <namhyung@...nel.org>,
        Adrian Hunter <adrian.hunter@...el.com>, kjain@...ux.ibm.com,
        atrajeev@...ux.vnet.ibm.com, barnali@...ux.ibm.com,
        ananth.narayan@....com, ravi.bangoria@....com,
        santosh.shukla@....com
Subject: Re: [PATCH] perf test: Retry without grouping for all metrics test

On Wed, Dec 6, 2023 at 5:08 AM Arnaldo Carvalho de Melo <acme@...nel.org> wrote:
>
> Em Wed, Jun 14, 2023 at 05:08:21PM +0530, Ayush Jain escreveu:
> > On 6/14/2023 2:37 PM, Sandipan Das wrote:
> > > There are cases where a metric uses more events than the number of
> > > counters. E.g. AMD Zen, Zen 2 and Zen 3 processors have four data fabric
> > > counters but the "nps1_die_to_dram" metric has eight events. By default,
> > > the constituent events are placed in a group. Since the events cannot be
> > > scheduled at the same time, the metric is not computed. The all metrics
> > > test also fails because of this.
>
> Humm, I'm not being able to reproduce here the problem, before applying
> this patch:
>
> [root@...e ~]# grep -m1 "model name" /proc/cpuinfo
> model name      : AMD Ryzen 9 5950X 16-Core Processor
> [root@...e ~]# perf test -vvv "perf all metrics test"
> 104: perf all metrics test                                           :
> --- start ---
> test child forked, pid 1379713
> Testing branch_misprediction_ratio
> Testing all_remote_links_outbound
> Testing nps1_die_to_dram
> Testing macro_ops_dispatched
> Testing all_l2_cache_accesses
> Testing all_l2_cache_hits
> Testing all_l2_cache_misses
> Testing ic_fetch_miss_ratio
> Testing l2_cache_accesses_from_l2_hwpf
> Testing l2_cache_misses_from_l2_hwpf
> Testing op_cache_fetch_miss_ratio
> Testing l3_read_miss_latency
> Testing l1_itlb_misses
> test child finished with 0
> ---- end ----
> perf all metrics test: Ok
> [root@...e ~]#

Please don't apply the patch. The patch masks a bug in metrics/PMUs
and the proper fix was:
8d40f74ebf21 perf vendor events amd: Fix large metrics
https://lore.kernel.org/r/20230706063440.54189-1-sandipan.das@amd.com

> [root@...e ~]# perf stat -M nps1_die_to_dram -a sleep 2
>
>  Performance counter stats for 'system wide':
>
>                  0      dram_channel_data_controller_4   #  10885.3 MiB  nps1_die_to_dram       (49.96%)
>         31,334,338      dram_channel_data_controller_1                                          (50.01%)
>                  0      dram_channel_data_controller_6                                          (50.04%)
>         54,679,601      dram_channel_data_controller_3                                          (50.04%)
>         38,420,402      dram_channel_data_controller_0                                          (50.04%)
>                  0      dram_channel_data_controller_5                                          (49.99%)
>         54,012,661      dram_channel_data_controller_2                                          (49.96%)
>                  0      dram_channel_data_controller_7                                          (49.96%)
>
>        2.001465439 seconds time elapsed
>
> [root@...e ~]#
>
> [root@...e ~]# perf stat -v -M nps1_die_to_dram -a sleep 2
> Using CPUID AuthenticAMD-25-21-0
> metric expr dram_channel_data_controller_0 + dram_channel_data_controller_1 + dram_channel_data_controller_2 + dram_channel_data_controller_3 + dram_channel_data_controller_4 + dram_channel_data_controller_5 + dram_channel_data_controller_6 + dram_channel_data_controller_7 for nps1_die_to_dram
> found event dram_channel_data_controller_4
> found event dram_channel_data_controller_1
> found event dram_channel_data_controller_6
> found event dram_channel_data_controller_3
> found event dram_channel_data_controller_0
> found event dram_channel_data_controller_5
> found event dram_channel_data_controller_2
> found event dram_channel_data_controller_7
> Parsing metric events 'dram_channel_data_controller_4/metric-id=dram_channel_data_controller_4/,dram_channel_data_controller_1/metric-id=dram_channel_data_controller_1/,dram_channel_data_controller_6/metric-id=dram_channel_data_controller_6/,dram_channel_data_controller_3/metric-id=dram_channel_data_controller_3/,dram_channel_data_controller_0/metric-id=dram_channel_data_controller_0/,dram_channel_data_controller_5/metric-id=dram_channel_data_controller_5/,dram_channel_data_controller_2/metric-id=dram_channel_data_controller_2/,dram_channel_data_controller_7/metric-id=dram_channel_data_controller_7/'
> dram_channel_data_controller_4 -> amd_df/metric-id=dram_channel_data_controller_4,dram_channel_data_controller_4/
> dram_channel_data_controller_1 -> amd_df/metric-id=dram_channel_data_controller_1,dram_channel_data_controller_1/
> Multiple errors dropping message: Cannot find PMU `dram_channel_data_controller_1'. Missing kernel support? (<no help>)
> dram_channel_data_controller_6 -> amd_df/metric-id=dram_channel_data_controller_6,dram_channel_data_controller_6/
> Multiple errors dropping message: Cannot find PMU `dram_channel_data_controller_6'. Missing kernel support? (<no help>)
> dram_channel_data_controller_3 -> amd_df/metric-id=dram_channel_data_controller_3,dram_channel_data_controller_3/
> Multiple errors dropping message: Cannot find PMU `dram_channel_data_controller_3'. Missing kernel support? (<no help>)
> dram_channel_data_controller_0 -> amd_df/metric-id=dram_channel_data_controller_0,dram_channel_data_controller_0/
> Multiple errors dropping message: Cannot find PMU `dram_channel_data_controller_0'. Missing kernel support? (<no help>)
> dram_channel_data_controller_5 -> amd_df/metric-id=dram_channel_data_controller_5,dram_channel_data_controller_5/
> Multiple errors dropping message: Cannot find PMU `dram_channel_data_controller_5'. Missing kernel support? (<no help>)
> dram_channel_data_controller_2 -> amd_df/metric-id=dram_channel_data_controller_2,dram_channel_data_controller_2/
> Multiple errors dropping message: Cannot find PMU `dram_channel_data_controller_2'. Missing kernel support? (<no help>)
> dram_channel_data_controller_7 -> amd_df/metric-id=dram_channel_data_controller_7,dram_channel_data_controller_7/
> Matched metric-id dram_channel_data_controller_4 to dram_channel_data_controller_4
> Matched metric-id dram_channel_data_controller_1 to dram_channel_data_controller_1
> Matched metric-id dram_channel_data_controller_6 to dram_channel_data_controller_6
> Matched metric-id dram_channel_data_controller_3 to dram_channel_data_controller_3
> Matched metric-id dram_channel_data_controller_0 to dram_channel_data_controller_0
> Matched metric-id dram_channel_data_controller_5 to dram_channel_data_controller_5
> Matched metric-id dram_channel_data_controller_2 to dram_channel_data_controller_2
> Matched metric-id dram_channel_data_controller_7 to dram_channel_data_controller_7
> Control descriptor is not initialized
> dram_channel_data_controller_4: 0 2001175127 999996394
> dram_channel_data_controller_1: 32346663 2001169897 1000709803
> dram_channel_data_controller_6: 0 2001168377 1001193443
> dram_channel_data_controller_3: 47551247 2001166947 1001198122
> dram_channel_data_controller_0: 38975242 2001165217 1001182923
> dram_channel_data_controller_5: 0 2001163067 1000464054
> dram_channel_data_controller_2: 49934162 2001160907 999974934
> dram_channel_data_controller_7: 0 2001150317 999968825
>
>  Performance counter stats for 'system wide':
>
>                  0      dram_channel_data_controller_4   #  10297.2 MiB  nps1_die_to_dram       (49.97%)
>         32,346,663      dram_channel_data_controller_1                                          (50.01%)
>                  0      dram_channel_data_controller_6                                          (50.03%)
>         47,551,247      dram_channel_data_controller_3                                          (50.03%)
>         38,975,242      dram_channel_data_controller_0                                          (50.03%)
>                  0      dram_channel_data_controller_5                                          (49.99%)
>         49,934,162      dram_channel_data_controller_2                                          (49.97%)
>                  0      dram_channel_data_controller_7                                          (49.97%)
>
>        2.001196512 seconds time elapsed
>
> [root@...e ~]#
>
> What am I missing?
>
> Ian, I also stumbled on this:
>
> [root@...e ~]# perf stat -M dram_channel_data_controller_4
> Cannot find metric or group `dram_channel_data_controller_4'
> ^C
>  Performance counter stats for 'system wide':
>
>         284,908.91 msec cpu-clock                        #   32.002 CPUs utilized
>          6,485,456      context-switches                 #   22.763 K/sec
>                719      cpu-migrations                   #    2.524 /sec
>             32,800      page-faults                      #  115.125 /sec
>    189,779,273,552      cycles                           #    0.666 GHz                         (83.33%)
>      2,893,165,259      stalled-cycles-frontend          #    1.52% frontend cycles idle        (83.33%)
>     24,807,157,349      stalled-cycles-backend           #   13.07% backend cycles idle         (83.33%)
>     99,286,488,807      instructions                     #    0.52  insn per cycle
>                                                   #    0.25  stalled cycles per insn     (83.33%)
>     24,120,737,678      branches                         #   84.661 M/sec                       (83.33%)
>      1,907,540,278      branch-misses                    #    7.91% of all branches             (83.34%)
>
>        8.902784776 seconds time elapsed
>
>
> [root@...e ~]#
> [root@...e ~]# perf stat -e dram_channel_data_controller_4
> ^C
>  Performance counter stats for 'system wide':
>
>                  0      dram_channel_data_controller_4
>
>        1.189638741 seconds time elapsed
>
>
> [root@...e ~]#
>
> I.e. -M should bail out at that point (Cannot find metric or group `dram_channel_data_controller_4'), no?

We could. I suspect the code has always just not bailed out. I'll put
together a patch adding the bail out.

Thanks,
Ian

> - Arnaldo
>
> > > Before announcing failure, the test can try multiple options for each
> > > available metric. After system-wide mode fails, retry once again with
> > > the "--metric-no-group" option.
> > >
> > > E.g.
> > >
> > >    $ sudo perf test -v 100
> > >
> > > Before:
> > >
> > >    100: perf all metrics test                                           :
> > >    --- start ---
> > >    test child forked, pid 672731
> > >    Testing branch_misprediction_ratio
> > >    Testing all_remote_links_outbound
> > >    Testing nps1_die_to_dram
> > >    Metric 'nps1_die_to_dram' not printed in:
> > >    Error:
> > >    Invalid event (dram_channel_data_controller_4) in per-thread mode, enable system wide with '-a'.
> > >    Testing macro_ops_dispatched
> > >    Testing all_l2_cache_accesses
> > >    Testing all_l2_cache_hits
> > >    Testing all_l2_cache_misses
> > >    Testing ic_fetch_miss_ratio
> > >    Testing l2_cache_accesses_from_l2_hwpf
> > >    Testing l2_cache_misses_from_l2_hwpf
> > >    Testing op_cache_fetch_miss_ratio
> > >    Testing l3_read_miss_latency
> > >    Testing l1_itlb_misses
> > >    test child finished with -1
> > >    ---- end ----
> > >    perf all metrics test: FAILED!
> > >
> > > After:
> > >
> > >    100: perf all metrics test                                           :
> > >    --- start ---
> > >    test child forked, pid 672887
> > >    Testing branch_misprediction_ratio
> > >    Testing all_remote_links_outbound
> > >    Testing nps1_die_to_dram
> > >    Testing macro_ops_dispatched
> > >    Testing all_l2_cache_accesses
> > >    Testing all_l2_cache_hits
> > >    Testing all_l2_cache_misses
> > >    Testing ic_fetch_miss_ratio
> > >    Testing l2_cache_accesses_from_l2_hwpf
> > >    Testing l2_cache_misses_from_l2_hwpf
> > >    Testing op_cache_fetch_miss_ratio
> > >    Testing l3_read_miss_latency
> > >    Testing l1_itlb_misses
> > >    test child finished with 0
> > >    ---- end ----
> > >    perf all metrics test: Ok
> > >
> >
> > Issue gets resolved after applying this patch
> >
> >   $ ./perf test 102 -vvv
> >   $102: perf all metrics test                                           :
> >   $--- start ---
> >   $test child forked, pid 244991
> >   $Testing branch_misprediction_ratio
> >   $Testing all_remote_links_outbound
> >   $Testing nps1_die_to_dram
> >   $Testing all_l2_cache_accesses
> >   $Testing all_l2_cache_hits
> >   $Testing all_l2_cache_misses
> >   $Testing ic_fetch_miss_ratio
> >   $Testing l2_cache_accesses_from_l2_hwpf
> >   $Testing l2_cache_misses_from_l2_hwpf
> >   $Testing l3_read_miss_latency
> >   $Testing l1_itlb_misses
> >   $test child finished with 0
> >   $---- end ----
> >   $perf all metrics test: Ok
> >
> > > Reported-by: Ayush Jain <ayush.jain3@....com>
> > > Signed-off-by: Sandipan Das <sandipan.das@....com>
> >
> > Tested-by: Ayush Jain <ayush.jain3@....com>
> >
> > > ---
> > >   tools/perf/tests/shell/stat_all_metrics.sh | 7 +++++++
> > >   1 file changed, 7 insertions(+)
> > >
> > > diff --git a/tools/perf/tests/shell/stat_all_metrics.sh b/tools/perf/tests/shell/stat_all_metrics.sh
> > > index 54774525e18a..1e88ea8c5677 100755
> > > --- a/tools/perf/tests/shell/stat_all_metrics.sh
> > > +++ b/tools/perf/tests/shell/stat_all_metrics.sh
> > > @@ -16,6 +16,13 @@ for m in $(perf list --raw-dump metrics); do
> > >     then
> > >       continue
> > >     fi
> > > +  # Failed again, possibly there are not enough counters so retry system wide
> > > +  # mode but without event grouping.
> > > +  result=$(perf stat -M "$m" --metric-no-group -a sleep 0.01 2>&1)
> > > +  if [[ "$result" =~ ${m:0:50} ]]
> > > +  then
> > > +    continue
> > > +  fi
> > >     # Failed again, possibly the workload was too small so retry with something
> > >     # longer.
> > >     result=$(perf stat -M "$m" perf bench internals synthesize 2>&1)
> >
> > Thanks & Regards,
> > Ayush Jain
>
> --
>
> - Arnaldo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ