lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ec744c86-b73e-417a-8e3a-c07142bf37d1@linux.intel.com>
Date: Mon, 15 Jul 2024 16:05:27 -0400
From: "Liang, Kan" <kan.liang@...ux.intel.com>
To: kernel test robot <oliver.sang@...el.com>, Ian Rogers <irogers@...gle.com>
Cc: oe-lkp@...ts.linux.dev, lkp@...el.com,
 Linux Memory Management List <linux-mm@...ck.org>,
 Namhyung Kim <namhyung@...nel.org>, Weilin Wang <weilin.wang@...el.com>,
 Caleb Biggers <caleb.biggers@...el.com>,
 Alexandre Torgue <alexandre.torgue@...s.st.com>,
 Maxime Coquelin <mcoquelin.stm32@...il.com>,
 linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [linux-next:master] [perf vendor events] e2641db83f:
 perf-sanity-tests.perf_all_PMU_test.fail

Hi Ian,

On 2024-07-10 12:59 a.m., kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed "perf-sanity-tests.perf_all_PMU_test.fail" on:
> 
> commit: e2641db83f18782f57a0e107c50d2d1731960fb8 ("perf vendor events: Add/update skylake events/metrics")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> 
> [test failed on linux-next/master 82d01fe6ee52086035b201cfa1410a3b04384257]
> 
> in testcase: perf-sanity-tests
> version: 
> with following parameters:
> 
> 	perf_compiler: gcc
> 
> 
> 
> compiler: gcc-13
> test machine: 16 threads 1 sockets Intel(R) Xeon(R) E-2278G CPU @ 3.40GHz (Coffee Lake) with 32G memory
> 
> (please refer to attached dmesg/kmsg for entire log/backtrace)
> 
> 
> we also observed two cases which also failed on parent can pass on this commit.
> FYI.
> 
> 
> caccae3ce7b988b6 e2641db83f18782f57a0e107c50
> ---------------- ---------------------------
>        fail:runs  %reproduction    fail:runs
>            |             |             |
>            :6          100%           6:6     perf-sanity-tests.perf_all_PMU_test.fail
>            :6          100%           6:6     perf-sanity-tests.perf_all_metricgroups_test.pass
>            :6          100%           6:6     perf-sanity-tests.perf_all_metrics_test.pass
> 
> 
> 
> 
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@...el.com>
> | Closes: https://lore.kernel.org/oe-lkp/202407101021.2c8baddb-oliver.sang@intel.com
> 
> 
> 
> 2024-07-09 07:09:53 sudo /usr/src/linux-perf-x86_64-rhel-8.3-bpf-e2641db83f18782f57a0e107c50d2d1731960fb8/tools/perf/perf test 105
> 105: perf all metricgroups test                                      : Ok
> 2024-07-09 07:10:11 sudo /usr/src/linux-perf-x86_64-rhel-8.3-bpf-e2641db83f18782f57a0e107c50d2d1731960fb8/tools/perf/perf test 106
> 106: perf all metrics test                                           : Ok
> 2024-07-09 07:10:23 sudo /usr/src/linux-perf-x86_64-rhel-8.3-bpf-e2641db83f18782f57a0e107c50d2d1731960fb8/tools/perf/perf test 107
> 107: perf all libpfm4 events test                                    : Ok
> 2024-07-09 07:10:47 sudo /usr/src/linux-perf-x86_64-rhel-8.3-bpf-e2641db83f18782f57a0e107c50d2d1731960fb8/tools/perf/perf test 108
> 108: perf all PMU test                                               : FAILED!
>

The failure is caused by the below change in the e2641db83f18.

+    {
+        "BriefDescription": "This 48-bit fixed counter counts the UCLK
cycles",
+        "Counter": "FIXED",
+        "EventCode": "0xff",
+        "EventName": "UNC_CLOCK.SOCKET",
+        "PerPkg": "1",
+        "PublicDescription": "This 48-bit fixed counter counts the UCLK
cycles.",
+        "Unit": "cbox_0"
     }

The other cbox events have the unit name "CBOX", while the fixed counter
has a unit name "cbox_0". So the events_table will maintain separate
entries for cbox and cbox_0.

The perf_pmus__print_pmu_events() calculate the total number of events,
allocate an aliases buffer, store all the events into the buffer, sort,
and print all the aliases one by one.

The problem is that the calculated total number of events doesn't match
the stored events on the SKL machine.

The perf_pmu__num_events() is used to calculate the number of events. It
invokes the pmu_events_table__num_events() to go through the entire
events_table to find all events. Because of the
pmu_uncore_alias_match(), the suffix of uncore PMU will be ignored. So
the events for cbox and cbox_0 are all counted.

When storing events into the aliases buffer, the
perf_pmu__for_each_event() only process the events for cbox.

Since a bigger buffer was allocated, the last entry are all 0.
When printing all the aliases, null will be outputed.

$ perf list pmu

List of pre-defined events (to be used in -e or -M):

  (null)                                             [Kernel PMU event]
  branch-instructions OR cpu/branch-instructions/    [Kernel PMU event]
  branch-misses OR cpu/branch-misses/                [Kernel PMU event]


I'm thinking of two ways to address it.
One is to only print all the stored events. The below patch can fix it.

diff --git a/tools/perf/util/pmus.c b/tools/perf/util/pmus.c
index 3fcabfd8fca1..2b2f5117ff84 100644
--- a/tools/perf/util/pmus.c
+++ b/tools/perf/util/pmus.c
@@ -485,6 +485,7 @@ void perf_pmus__print_pmu_events(const struct
print_callbacks *print_cb, void *p
 		perf_pmu__for_each_event(pmu, skip_duplicate_pmus, &state,
 					perf_pmus__print_pmu_events__callback);
 	}
+	len = state.index;
 	qsort(aliases, len, sizeof(struct sevent), cmp_sevent);
 	for (int j = 0; j < len; j++) {
 		/* Skip duplicates */

The only drawback is that perf list will not show the new cbox_0 event.
(But the event name still works. Users can still apply perf stat -e
unc_clock.socket.)

Since the cbox_0 event is only available on old machines (SKL and
earlier), people should already use the equivalent kernel event. It
doesn't sounds a big issue for me. I prefer this simple fix.

I think the other way would be to modify the perf_pmu__for_each_event()
to go through all the possible PMUs.
It seems complicated and may impact others ARCHs (e.g., S390). I haven't
tried it yet.

What do you think?
Do you see any other ways to address the issue?

Thanks,
Kan

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ