netdev - Re: Issue of metrics for multiple uncore PMUs (was Re: [RFC PATCH v2 23/23] perf metricgroup: remove duped metric group events)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <757974b3-62b0-2822-84fb-1e75907c6cc4@huawei.com>
Date:   Mon, 5 Oct 2020 11:03:36 +0100
From:   John Garry <john.garry@...wei.com>
To:     Ian Rogers <irogers@...gle.com>
CC:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Mark Rutland <mark.rutland@....com>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Jiri Olsa <jolsa@...hat.com>,
        Namhyung Kim <namhyung@...nel.org>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Martin KaFai Lau <kafai@...com>,
        Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
        Andrii Nakryiko <andriin@...com>,
        John Fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...omium.org>,
        Kajol Jain <kjain@...ux.ibm.com>,
        Andi Kleen <ak@...ux.intel.com>,
        Jin Yao <yao.jin@...ux.intel.com>,
        Kan Liang <kan.liang@...ux.intel.com>,
        Cong Wang <xiyou.wangcong@...il.com>,
        Kim Phillips <kim.phillips@....com>,
        LKML <linux-kernel@...r.kernel.org>,
        Networking <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
        linux-perf-users <linux-perf-users@...r.kernel.org>,
        Stephane Eranian <eranian@...gle.com>
Subject: Re: Issue of metrics for multiple uncore PMUs (was Re: [RFC PATCH v2
 23/23] perf metricgroup: remove duped metric group events)

On 02/10/2020 21:46, Ian Rogers wrote:
> On Fri, Oct 2, 2020 at 5:00 AM John Garry <john.garry@...wei.com> wrote:
>>
>> On 07/05/2020 15:08, Ian Rogers wrote:
>>
>> Hi Ian,
>>
>> I was wondering if you ever tested commit 2440689d62e9 ("perf
>> metricgroup: Remove duped metric group events") for when we have a
>> metric which aliases multiple instances of the same uncore PMU in the
>> system?
> 
> Sorry for this, I hadn't tested such a metric and wasn't aware of how
> the aliasing worked. I sent a fix for this issue here:
> https://lore.kernel.org/lkml/20200917201807.4090224-1-irogers@google.com/
> Could you see if this addresses the issue for you? I don't see the
> change in Arnaldo's trees yet.

Unfortunately this does not seem to fix my issue.

So for that patch, you say you fix metric expression for DRAM_BW_Use, 
which is:

{
  "BriefDescription": "Average external Memory Bandwidth Use for reads 
and writes [GB / sec]",
  "MetricExpr": "( 64 * ( uncore_imc@..._count_read@ + 
uncore_imc@..._count_write@ ) / 1000000000 ) / duration_time",
  "MetricGroup": "Memory_BW",
"MetricName": "DRAM_BW_Use"
},

But this metric expression does not include any alias events; rather I 
think it is just cas_count_write + cas_count_read event count for PMU 
uncore_imc / duration_time.

When I say alias, I mean - as an example, we have event:

     {
         "BriefDescription": "write requests to memory controller. 
Derived from unc_m_cas_count.wr",
         "Counter": "0,1,2,3",
         "EventCode": "0x4",
         "EventName": "LLC_MISSES.MEM_WRITE",
         "PerPkg": "1",
         "ScaleUnit": "64Bytes",
         "UMask": "0xC",
         "Unit": "iMC"
     },

And then reference LLC_MISSES.MEM_WRITE in a metric expression:

"MetricExpr": "LLC_MISSES.MEM_WRITE / duration_time",

This is what seems to be broken for when the alias matches > 1 PMU.

Please check this.

Thanks,
John

> 
> Thanks,
> Ian
> 
>> I have been rebasing some of my arm64 perf work to v5.9-rc7, and find an
>> issue where find_evsel_group() fails for the uncore metrics under the
>> condition mentioned above.
>>
>> Unfortunately I don't have an x86 machine to which this test applies.
>> However, as an experiment, I added a test metric to my broadwell JSON:
>>
>> diff --git a/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json
>> b/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json
>> index 8cdc7c13dc2a..fc6d9adf996a 100644
>> --- a/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json
>> +++ b/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json
>> @@ -348,5 +348,11 @@
>>           "MetricExpr": "(cstate_pkg@c7\\-residency@ / msr@tsc@) * 100",
>>           "MetricGroup": "Power",
>>           "MetricName": "C7_Pkg_Residency"
>> +    },
>> +    {
>> +        "BriefDescription": "test metric",
>> +        "MetricExpr": "UNC_CBO_XSNP_RESPONSE.MISS_XCORE *
>> UNC_CBO_XSNP_RESPONSE.MISS_EVICTION",
>> +        "MetricGroup": "Test",
>> +        "MetricName": "test_metric_inc"
>>       }
>> ]
>>
>>
>> And get this:
>>
>> john@...alhost:~/linux/tools/perf> sudo ./perf stat -v -M
>> test_metric_inc sleep 1
>> Using CPUID GenuineIntel-6-3D-4
>> metric expr unc_cbo_xsnp_response.miss_xcore *
>> unc_cbo_xsnp_response.miss_eviction for test_metric_inc
>> found event unc_cbo_xsnp_response.miss_eviction
>> found event unc_cbo_xsnp_response.miss_xcore
>> adding
>> {unc_cbo_xsnp_response.miss_eviction,unc_cbo_xsnp_response.miss_xcore}:W
>> unc_cbo_xsnp_response.miss_eviction -> uncore_cbox_1/umask=0x81,event=0x22/
>> unc_cbo_xsnp_response.miss_eviction -> uncore_cbox_0/umask=0x81,event=0x22/
>> unc_cbo_xsnp_response.miss_xcore -> uncore_cbox_1/umask=0x41,event=0x22/
>> unc_cbo_xsnp_response.miss_xcore -> uncore_cbox_0/umask=0x41,event=0x22/
>> Cannot resolve test_metric_inc: unc_cbo_xsnp_response.miss_xcore *
>> unc_cbo_xsnp_response.miss_eviction
>> task-clock: 688876 688876 688876
>> context-switches: 2 688876 688876
>> cpu-migrations: 0 688876 688876
>> page-faults: 69 688876 688876
>> cycles: 2101719 695690 695690
>> instructions: 1180534 695690 695690
>> branches: 249450 695690 695690
>> branch-misses: 10815 695690 695690
>>
>> Performance counter stats for 'sleep 1':
>>
>>                0.69 msec task-clock                #    0.001 CPUs
>> utilized
>>                   2      context-switches          #    0.003 M/sec
>>
>>                   0      cpu-migrations            #    0.000 K/sec
>>
>>                  69      page-faults               #    0.100 M/sec
>>
>>           2,101,719      cycles                    #    3.051 GHz
>>
>>           1,180,534      instructions              #    0.56  insn per
>> cycle
>>             249,450      branches                  #  362.112 M/sec
>>
>>              10,815      branch-misses             #    4.34% of all
>> branches
>>
>>         1.001177693 seconds time elapsed
>>
>>         0.001149000 seconds user
>>         0.000000000 seconds sys
>>
>>
>> john@...alhost:~/linux/tools/perf>
>>
>>
>> Any idea what is going wrong here, before I have to dive in? The issue
>> seems to be this named commit.
>>
>> Thanks,
>> John
>>
>>> A metric group contains multiple metrics. These metrics may use the same
>>> events. If metrics use separate events then it leads to more
>>> multiplexing and overall metric counts fail to sum to 100%.
>>> Modify how metrics are associated with events so that if the events in
>>> an earlier group satisfy the current metric, the same events are used.
>>> A record of used events is kept and at the end of processing unnecessary
>>> events are eliminated.
>>>
>>> Before:
> .
>