[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8fbf8627-e7d3-4271-95ce-2c17c5933cc0@amd.com>
Date: Fri, 9 May 2025 14:10:09 +0530
From: Sandipan Das <sandipan.das@....com>
To: Ian Rogers <irogers@...gle.com>
Cc: linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org,
Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Namhyung Kim <namhyung@...nel.org>, Mark Rutland <mark.rutland@....com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Jiri Olsa <jolsa@...nel.org>, Adrian Hunter <adrian.hunter@...el.com>,
Kan Liang <kan.liang@...ux.intel.com>, Stephane Eranian
<eranian@...gle.com>, Ravi Bangoria <ravi.bangoria@....com>,
Ananth Narayan <ananth.narayan@....com>
Subject: Re: [PATCH 0/3] perf vendor events amd: Address event errata
On 5/8/2025 9:04 PM, Ian Rogers wrote:
> On Thu, May 8, 2025 at 3:56 AM Sandipan Das <sandipan.das@....com> wrote:
>>
>> On 5/7/2025 9:26 PM, Ian Rogers wrote:
>>> On Wed, May 7, 2025 at 7:28 AM Sandipan Das <sandipan.das@....com> wrote:
>>>>
>>>> Remove unreliable Zen 5 events and metrics. The following errata from
>>>> the Revision Guide for AMD Family 1Ah Models 00h-0Fh Processors have
>>>> been addressed.
>>>> #1569 PMCx078 Counts Incorrectly in Unpredictable Ways
>>>> #1583 PMCx18E May Overcount Instruction Cache Accesses
>>>> #1587 PMCx188 May Undercount IBS (Instruction Based Sampling) Fetch Events
>>>>
>>>> The document can be downloaded from
>>>> https://bugzilla.kernel.org/attachment.cgi?id=308095
>>>
>>> Hi Sandipan,
>>>
>>> the document is somewhat brief, for example:
>>> ```
>>> 1583 PMCx18E May Overcount Instruction Cache Accesses
>>>
>>> Description
>>> If PMCx18E[IcAccessTypes] is programmed to 18x (Instruction Cache
>>> Miss) or 1Fx (All Instruction Cache Accesses) then the performance
>>> counter may overcount.
>>>
>>> Potential Effect on System
>>> Inaccuracies in performance monitoring software may be experienced.
>>>
>>> Suggested Workaround
>>> None
>>>
>>> Fix Planned
>>> No fix planned
>>> ```
>>> Given being able to count instruction cache accesses (for example) is
>>> a useful feature, would it be possible to change:
>>> ```
>>> - {
>>> - "EventName": "ic_tag_hit_miss.instruction_cache_hit",
>>> - "EventCode": "0x18e",
>>> - "BriefDescription": "Instruction cache hits.",
>>> - "UMask": "0x07"
>>> - },
>>> ...
>>> ```
>>> to be say:
>>> ```
>>> {
>>> "EventName": "ic_tag_hit_miss.instruction_cache_hit",
>>> "EventCode": "0x18e",
>>> "BriefDescription": "Instruction cache hits. Note, this counter is
>>> affected by errata 1583.",
>>> "UMask": "0x07",
>>> "Experimental": "1"
>>> },
>>> ```
>>> That is rather than remove the event, the event is tagged as
>>> experimental (taken to mean accuracy isn't guaranteed) and the errata
>>> is explicitly noted in the description. Currently the Experimental tag
>>> has no impact on what happens in the perf tool, for example, the
>>> "Deprecated" tag hides events in the `perf list` command and is
>>> commonly used when an event is renamed.
>>>
>>
>> I agree that events like IC hits and misses are generally useful and am
>> fine with the idea of keeping them but my concern is that unless users
>> read the event description, there is no way for them to know if the
>> perf output that they are seeing may be unreliable. There is also no
>> guarantee that such events will be fixed in a future uarch. From a
>> quick glance, I couldn't find a mechanism that makes perf stat/report
>> show a warning for named events with known issues.
>
> So I'm forgetting the flow, but rediscovering it. We do have an Errata
> json value as shown in:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/pmu-events/arch/arm64/ampere/ampereone/memory.json?h=perf-tools-next#n2
> ```
> {
> "ArchStdEvent": "LD_RETIRED",
> "Errata": "Errata AC03_CPU_52",
> "BriefDescription": "Instruction architecturally executed,
> condition code check pass, load.
> Impacted by errata -"
> },
> ```
> It doesn't impact perf stat/record but it does get added to the event
> description for perf list:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/pmu-events/jevents.py?h=perf-tools-next#n340
> ```
> if 'Errata' in jd:
> extra_desc += ' Spec update: ' + jd['Errata']
> ```
> which means the perf list description ends up as "Instruction
> architecturally executed, condition code check pass, load. Impacted by
> errata - Spec update: Errata AC03_CPU_52". We could change this so
> that the Errata is distinct in the encoded in perf json and then we
> could display the errata when perf stat/record parses the event. I'd
> be a little worried about this breaking things that parse perf's text
> output, but the impact would be limited to Zen5, Ampere and older
> Intel CPUs. We could also make the errata output conditional on
> passing a verbose flag to perf. Would just `perf list` support work
> for you or would the perf stat/record changes be a requirement for
> keeping these events?
>
Adding "Errata" tags to the affected events is a good start.
We can sort out the perf stat/record changes eventually.
Powered by blists - more mailing lists