linux-kernel - Re: [PATCH] perf test: Retry without grouping for all metrics test

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAP-5=fV9Fx99QmKWSqqDK23vF0dcTS+g-r-9zr6q0A2ZXWmCBw@mail.gmail.com>
Date:   Wed, 14 Jun 2023 09:40:05 -0700
From:   Ian Rogers <irogers@...gle.com>
To:     Sandipan Das <sandipan.das@....com>
Cc:     linux-kernel@...r.kernel.org, linux-perf-users@...r.kernel.org,
        peterz@...radead.org, mingo@...hat.com, acme@...nel.org,
        mark.rutland@....com, alexander.shishkin@...ux.intel.com,
        jolsa@...nel.org, namhyung@...nel.org, adrian.hunter@...el.com,
        kjain@...ux.ibm.com, atrajeev@...ux.vnet.ibm.com,
        barnali@...ux.ibm.com, ayush.jain3@....com, ananth.narayan@....com,
        ravi.bangoria@....com, santosh.shukla@....com
Subject: Re: [PATCH] perf test: Retry without grouping for all metrics test

On Wed, Jun 14, 2023 at 2:07 AM Sandipan Das <sandipan.das@....com> wrote:
>
> There are cases where a metric uses more events than the number of
> counters. E.g. AMD Zen, Zen 2 and Zen 3 processors have four data fabric
> counters but the "nps1_die_to_dram" metric has eight events. By default,
> the constituent events are placed in a group. Since the events cannot be
> scheduled at the same time, the metric is not computed. The all metrics
> test also fails because of this.

Thanks Sandipan. So this is exposing a bug in the AMD data fabric PMU
driver. When the events are added the driver should create a fake PMU,
check that adding the group is valid and if not fail. The failure is
picked up by the tool and it will remove the group.

I appreciate the need for a time machine to make such a fix work. To
workaround the issue with the metrics add:
"MetricConstraint": "NO_GROUP_EVENTS",
to each metric in the json.

> Before announcing failure, the test can try multiple options for each
> available metric. After system-wide mode fails, retry once again with
> the "--metric-no-group" option.
>
> E.g.
>
>   $ sudo perf test -v 100
>
> Before:
>
>   100: perf all metrics test                                           :
>   --- start ---
>   test child forked, pid 672731
>   Testing branch_misprediction_ratio
>   Testing all_remote_links_outbound
>   Testing nps1_die_to_dram
>   Metric 'nps1_die_to_dram' not printed in:
>   Error:
>   Invalid event (dram_channel_data_controller_4) in per-thread mode, enable system wide with '-a'.

This error doesn't relate to grouping, so I'm confused about having it
in the commit message, aside from the test failure.

Thanks,
Ian

>   Testing macro_ops_dispatched
>   Testing all_l2_cache_accesses
>   Testing all_l2_cache_hits
>   Testing all_l2_cache_misses
>   Testing ic_fetch_miss_ratio
>   Testing l2_cache_accesses_from_l2_hwpf
>   Testing l2_cache_misses_from_l2_hwpf
>   Testing op_cache_fetch_miss_ratio
>   Testing l3_read_miss_latency
>   Testing l1_itlb_misses
>   test child finished with -1
>   ---- end ----
>   perf all metrics test: FAILED!
>
> After:
>
>   100: perf all metrics test                                           :
>   --- start ---
>   test child forked, pid 672887
>   Testing branch_misprediction_ratio
>   Testing all_remote_links_outbound
>   Testing nps1_die_to_dram
>   Testing macro_ops_dispatched
>   Testing all_l2_cache_accesses
>   Testing all_l2_cache_hits
>   Testing all_l2_cache_misses
>   Testing ic_fetch_miss_ratio
>   Testing l2_cache_accesses_from_l2_hwpf
>   Testing l2_cache_misses_from_l2_hwpf
>   Testing op_cache_fetch_miss_ratio
>   Testing l3_read_miss_latency
>   Testing l1_itlb_misses
>   test child finished with 0
>   ---- end ----
>   perf all metrics test: Ok
>
> Reported-by: Ayush Jain <ayush.jain3@....com>
> Signed-off-by: Sandipan Das <sandipan.das@....com>
> ---
>  tools/perf/tests/shell/stat_all_metrics.sh | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/tools/perf/tests/shell/stat_all_metrics.sh b/tools/perf/tests/shell/stat_all_metrics.sh
> index 54774525e18a..1e88ea8c5677 100755
> --- a/tools/perf/tests/shell/stat_all_metrics.sh
> +++ b/tools/perf/tests/shell/stat_all_metrics.sh
> @@ -16,6 +16,13 @@ for m in $(perf list --raw-dump metrics); do
>    then
>      continue
>    fi
> +  # Failed again, possibly there are not enough counters so retry system wide
> +  # mode but without event grouping.
> +  result=$(perf stat -M "$m" --metric-no-group -a sleep 0.01 2>&1)
> +  if [[ "$result" =~ ${m:0:50} ]]
> +  then
> +    continue
> +  fi
>    # Failed again, possibly the workload was too small so retry with something
>    # longer.
>    result=$(perf stat -M "$m" perf bench internals synthesize 2>&1)
> --
> 2.34.1
>