linux-kernel - Re: [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon PMCU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2d366bef-a891-6ee7-28bf-2091e0b78dbc@hisilicon.com>
Date:   Fri, 24 Mar 2023 17:32:15 +0800
From:   Jie Zhan <zhanjie9@...ilicon.com>
To:     Jonathan Cameron <Jonathan.Cameron@...wei.com>
CC:     <will@...nel.org>, <mark.rutland@....com>,
        <mathieu.poirier@...aro.org>, <suzuki.poulose@....com>,
        <mike.leach@...aro.org>, <leo.yan@...aro.org>,
        <john.g.garry@...cle.com>, <james.clark@....com>,
        <peterz@...radead.org>, <mingo@...hat.com>, <acme@...nel.org>,
        <corbet@....net>, <zhangshaokun@...ilicon.com>,
        <shenyang39@...wei.com>, <hejunhao3@...wei.com>,
        <yangyicong@...ilicon.com>, <prime.zeng@...wei.com>,
        <suntao25@...wei.com>, <jiazhao4@...ilicon.com>,
        <linuxarm@...wei.com>, <linux-doc@...r.kernel.org>,
        <linux-kernel@...r.kernel.org>,
        <linux-arm-kernel@...ts.infradead.org>,
        <linux-perf-users@...r.kernel.org>
Subject: Re: [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon
 PMCU



On 17/03/2023 21:37, Jonathan Cameron wrote:
> On Mon, 6 Feb 2023 14:51:43 +0800
> Jie Zhan <zhanjie9@...ilicon.com> wrote:
>
>> Document the overview and usage of HiSilicon PMCU.
>>
>> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>> PMU accesses from CPUs, handling the configuration, event switching, and
>> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
>> scheme may lose events or drop sampling frequency. With PMCU, users can
>> reliably obtain the data of up to 240 PMU events with the sample interval
>> of events down to 1ms, while the software overhead of accessing PMUs, as
>> well as its impact on target workloads, is reduced.
>>
>> Signed-off-by: Jie Zhan <zhanjie9@...ilicon.com>
> Nice documentation. I've read this a few times before, but on this read
> through wondered if we could say anything about the skew between capture
> of the counters.  Not that important though so I'm happy to add
>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@...wei.com>
>
> though this may of course need updating significantly as the interface
> is refined (the RFC question you raised for example in the cover letter).
>
> Thanks
>
> Jonathan
>
>> ---
>>   Documentation/admin-guide/perf/hisi-pmcu.rst | 183 +++++++++++++++++++
>>   Documentation/admin-guide/perf/index.rst     |   1 +
>>   2 files changed, 184 insertions(+)
>>   create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
>>
>> diff --git a/Documentation/admin-guide/perf/hisi-pmcu.rst b/Documentation/admin-guide/perf/hisi-pmcu.rst
>> new file mode 100644
>> index 000000000000..50d17cbd0049
>> --- /dev/null
>> +++ b/Documentation/admin-guide/perf/hisi-pmcu.rst
>> @@ -0,0 +1,183 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +==========================================
>> +HiSilicon Performance Monitor Control Unit
>> +==========================================
>> +
>> +Introduction
>> +============
>> +
>> +HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>> +PMU accesses from CPUs, handling the configuration, event switching, and
>> +counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>> +and multi-PMU-event CPU profiling, in which scenario the current ``perf``
>> +scheme may lose events or drop sampling frequency. With PMCU, users can
>> +reliably obtain the data of up to 240 PMU events with the sample interval
>> +of events down to 1ms, while the software overhead of accessing PMUs, as
>> +well as its impact on target workloads, is reduced.
>> +
>> +Each CPU die is equipped with a PMCU device. The PMCU driver registers it as a
>> +PMU device, named as ``hisi_pmcu_sccl<N>``, where ``<N>`` is the corresponding
>> +CPU die ID. When triggered, PMCU reads event IDs and pass them to PMUs in all
>> +CPUs on the CPU die it is on. PMCU then starts the counters for counting
>> +events, waits for a time interval, and stops them. The PMU counter readings are
>> +dumped from hardware to memory, i.e. perf AUX buffers, and further copied to
>> +the ``perf.data`` file in the user space. PMCU automatically switches events
>> +(when there are more events than available PMU counters) and completes multiple
>> +rounds of PMU event counting in one trigger.
>> +
>> +Hardware overview
>> +=================
>> +
>> +On Kunpeng SoC, each CPU die is equipped with a PMCU device. PMCU acts like an
>> +assistant to access the core PMUs on its die and move the counter readings to
>> +memory. An overview of PMCU's hardware organization is shown below::
>> +
>> +                                +--------------------+
>> +                                |       Memory       |
>> +                                | +------+ +-------+ |
>> +                   +--------+   | |Events| |Samples| |
>> +                   |  PMCU  |   | +------+ +-------+ |
>> +                   +---|----+   +---------|----------+
>> +                       |                  |
>> +        =======================================================  Bus
>> +                   |                         |               |
>> +        +----------|----------+   +----------|----------+    |
>> +        | +------+ | +------+ |   | +------+ | +------+ |    |
>> +        | |Core 0| | |Core 1| |   | |Core 0| | |Core 1| |    |
>> +        | +--|---+ | +--|---+ |   | +--|---+ | +--|---+ |  (More
>> +        |    +-----+----+     |   |    +-----+----+     |  clusters
>> +        | +--|---+   +--|---+ |   | +--|---+   +--|---+ |  ...)
>> +        | |Core 2|   |Core 3| |   | |Core 2|   |Core 3| |
>> +        | +------+   +------+ |   | +------+   +------+ |
>> +        |    CPU Cluster 0    |   |    CPU Cluster 1    |
>> +        +---------------------+   +---------------------+
>> +
>> +On Kunpeng SoC, a CPU die is formed of several CPU clusters and several
>> +CPUs per cluster. PMCU is able to access the core PMUs in these CPUs.
>> +The main job of PMCU is to fetch PMU event IDs from memory, make PMUs count the
>> +events for a while, and move the counter readings back to memory.
>> +
>> +Once triggered, PMCU performs a number of loops and processes a number of
>> +events in each loop. It fetches ``nr_pmu`` events from memory at a time, where
>> +``nr_pmu`` denotes the number of PMU counters to be used in each CPU. The
>> +``nr_pmu`` events are passed to the PMU counters of all CPUs on the CPU die
>> +where PMCU resides. Then, PMCU starts all the counters, waits for a period,
>> +stops all the counters, and moves the counter readings to memory, before
>> +handling the next ``nr_pmu`` events if there are more events to process in this
>> +loop. The number of loops and ``nr_pmu`` are determined by the driver, whereas
>> +the number of events to process depends on user inputs. The counters are
>> +stopped when PMCU reads counters and switches events, so there is a tiny time
>> +window during which the events are not counted.
> I'm not clear from this description whether there is 'skew' between the counters
> (beyond the normal issues from uarch).  Does the PMCU stop all counters
> then read them all (minimizing skew) or does it stop each CPUs set of counters
> and read those, or stop each individual counter before reading?
>
> My impression is that this feature is meant to be left running over timescales
> much longer than the sampling period so it may not be necessary to align the
> different lines on the resulting graphs perfectly.  Hence maybe this doesn't matter.
>
Thanks for pointing this out.

The PMCU stops all the counters before reading any counters (i.e. the 
first case you said).

The basic procedure is:
     start counters -> wait -> stop counters -> read and reset counters 
-> switch events -> start counters -> ...
where each step applys to all CPUs and counters.

The counters don't count during the tiny stop-start window.
I guess a small improvement would be: reset -> read -> switch -> reset 
-> ..., while the counters keep running,
but we still lose some event counts between read and reset, and thus, no 
fundamental differrence.

Regards,
Jie