[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <26103329-9d00-226f-6b85-386766814618@hisilicon.com>
Date: Sat, 25 Mar 2023 10:48:29 +0800
From: Jie Zhan <zhanjie9@...ilicon.com>
To: Jonathan Cameron <Jonathan.Cameron@...wei.com>
CC: <will@...nel.org>, <mark.rutland@....com>,
<mathieu.poirier@...aro.org>, <suzuki.poulose@....com>,
<mike.leach@...aro.org>, <leo.yan@...aro.org>,
<john.g.garry@...cle.com>, <james.clark@....com>,
<peterz@...radead.org>, <mingo@...hat.com>, <acme@...nel.org>,
<corbet@....net>, <shenyang39@...wei.com>, <hejunhao3@...wei.com>,
<yangyicong@...ilicon.com>, <prime.zeng@...wei.com>,
<suntao25@...wei.com>, <jiazhao4@...ilicon.com>,
<linuxarm@...wei.com>, <linux-doc@...r.kernel.org>,
<linux-kernel@...r.kernel.org>,
<linux-arm-kernel@...ts.infradead.org>,
<linux-perf-users@...r.kernel.org>
Subject: Re: [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon
PMCU
On 24/03/2023 20:14, Jonathan Cameron wrote:
> On Fri, 24 Mar 2023 17:32:15 +0800
> Jie Zhan <zhanjie9@...ilicon.com> wrote:
>
>> On 17/03/2023 21:37, Jonathan Cameron wrote:
>>> On Mon, 6 Feb 2023 14:51:43 +0800
>>> Jie Zhan <zhanjie9@...ilicon.com> wrote:
>>>
>>>> Document the overview and usage of HiSilicon PMCU.
>>>>
>>>> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>>>> PMU accesses from CPUs, handling the configuration, event switching, and
>>>> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>>>> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
>>>> scheme may lose events or drop sampling frequency. With PMCU, users can
>>>> reliably obtain the data of up to 240 PMU events with the sample interval
>>>> of events down to 1ms, while the software overhead of accessing PMUs, as
>>>> well as its impact on target workloads, is reduced.
>>>>
>>>> Signed-off-by: Jie Zhan <zhanjie9@...ilicon.com>
>>> Nice documentation. I've read this a few times before, but on this read
>>> through wondered if we could say anything about the skew between capture
>>> of the counters. Not that important though so I'm happy to add
>>>
>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@...wei.com>
>>>
>>> though this may of course need updating significantly as the interface
>>> is refined (the RFC question you raised for example in the cover letter).
>>>
>>> Thanks
>>>
>>> Jonathan
>>>
>>>> ---
>>>> Documentation/admin-guide/perf/hisi-pmcu.rst | 183 +++++++++++++++++++
>>>> Documentation/admin-guide/perf/index.rst | 1 +
>>>> 2 files changed, 184 insertions(+)
>>>> create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
>>>>
>>>> diff --git a/Documentation/admin-guide/perf/hisi-pmcu.rst b/Documentation/admin-guide/perf/hisi-pmcu.rst
>>>> new file mode 100644
>>>> index 000000000000..50d17cbd0049
>>>> --- /dev/null
>>>> +++ b/Documentation/admin-guide/perf/hisi-pmcu.rst
>>>> @@ -0,0 +1,183 @@
>>>> +.. SPDX-License-Identifier: GPL-2.0
>>>> +
>>>> +==========================================
>>>> +HiSilicon Performance Monitor Control Unit
>>>> +==========================================
>>>> +
>>>> +Introduction
>>>> +============
>>>> +
>>>> +HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>>>> +PMU accesses from CPUs, handling the configuration, event switching, and
>>>> +counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>>>> +and multi-PMU-event CPU profiling, in which scenario the current ``perf``
>>>> +scheme may lose events or drop sampling frequency. With PMCU, users can
>>>> +reliably obtain the data of up to 240 PMU events with the sample interval
>>>> +of events down to 1ms, while the software overhead of accessing PMUs, as
>>>> +well as its impact on target workloads, is reduced.
>>>> +
>>>> +Each CPU die is equipped with a PMCU device. The PMCU driver registers it as a
>>>> +PMU device, named as ``hisi_pmcu_sccl<N>``, where ``<N>`` is the corresponding
>>>> +CPU die ID. When triggered, PMCU reads event IDs and pass them to PMUs in all
>>>> +CPUs on the CPU die it is on. PMCU then starts the counters for counting
>>>> +events, waits for a time interval, and stops them. The PMU counter readings are
>>>> +dumped from hardware to memory, i.e. perf AUX buffers, and further copied to
>>>> +the ``perf.data`` file in the user space. PMCU automatically switches events
>>>> +(when there are more events than available PMU counters) and completes multiple
>>>> +rounds of PMU event counting in one trigger.
>>>> +
>>>> +Hardware overview
>>>> +=================
>>>> +
>>>> +On Kunpeng SoC, each CPU die is equipped with a PMCU device. PMCU acts like an
>>>> +assistant to access the core PMUs on its die and move the counter readings to
>>>> +memory. An overview of PMCU's hardware organization is shown below::
>>>> +
>>>> + +--------------------+
>>>> + | Memory |
>>>> + | +------+ +-------+ |
>>>> + +--------+ | |Events| |Samples| |
>>>> + | PMCU | | +------+ +-------+ |
>>>> + +---|----+ +---------|----------+
>>>> + | |
>>>> + ======================================================= Bus
>>>> + | | |
>>>> + +----------|----------+ +----------|----------+ |
>>>> + | +------+ | +------+ | | +------+ | +------+ | |
>>>> + | |Core 0| | |Core 1| | | |Core 0| | |Core 1| | |
>>>> + | +--|---+ | +--|---+ | | +--|---+ | +--|---+ | (More
>>>> + | +-----+----+ | | +-----+----+ | clusters
>>>> + | +--|---+ +--|---+ | | +--|---+ +--|---+ | ...)
>>>> + | |Core 2| |Core 3| | | |Core 2| |Core 3| |
>>>> + | +------+ +------+ | | +------+ +------+ |
>>>> + | CPU Cluster 0 | | CPU Cluster 1 |
>>>> + +---------------------+ +---------------------+
>>>> +
>>>> +On Kunpeng SoC, a CPU die is formed of several CPU clusters and several
>>>> +CPUs per cluster. PMCU is able to access the core PMUs in these CPUs.
>>>> +The main job of PMCU is to fetch PMU event IDs from memory, make PMUs count the
>>>> +events for a while, and move the counter readings back to memory.
>>>> +
>>>> +Once triggered, PMCU performs a number of loops and processes a number of
>>>> +events in each loop. It fetches ``nr_pmu`` events from memory at a time, where
>>>> +``nr_pmu`` denotes the number of PMU counters to be used in each CPU. The
>>>> +``nr_pmu`` events are passed to the PMU counters of all CPUs on the CPU die
>>>> +where PMCU resides. Then, PMCU starts all the counters, waits for a period,
>>>> +stops all the counters, and moves the counter readings to memory, before
>>>> +handling the next ``nr_pmu`` events if there are more events to process in this
>>>> +loop. The number of loops and ``nr_pmu`` are determined by the driver, whereas
>>>> +the number of events to process depends on user inputs. The counters are
>>>> +stopped when PMCU reads counters and switches events, so there is a tiny time
>>>> +window during which the events are not counted.
>>> I'm not clear from this description whether there is 'skew' between the counters
>>> (beyond the normal issues from uarch). Does the PMCU stop all counters
>>> then read them all (minimizing skew) or does it stop each CPUs set of counters
>>> and read those, or stop each individual counter before reading?
>>>
>>> My impression is that this feature is meant to be left running over timescales
>>> much longer than the sampling period so it may not be necessary to align the
>>> different lines on the resulting graphs perfectly. Hence maybe this doesn't matter.
>>>
>> Thanks for pointing this out.
>>
>> The PMCU stops all the counters before reading any counters (i.e. the
>> first case you said).
>>
>> The basic procedure is:
>> start counters -> wait -> stop counters -> read and reset counters
>> -> switch events -> start counters -> ...
>> where each step applys to all CPUs and counters.
> Great. So this is across all cores on a die so skew should be minimized
> (at a cost of missing more events than a skew heavy approach).
>
>> The counters don't count during the tiny stop-start window.
>> I guess a small improvement would be: reset -> read -> switch -> reset
>> -> ..., while the counters keep running,
>> but we still lose some event counts between read and reset, and thus, no
>> fundamental differrence.
> Lots of ways to reduce both skew and missed counts, but I think you are
> right in that none of them matter for the intended long term monitoring
> usecase.
>
> Jonathan
Yeah it focuses more on general workload characteristics than
time-senstive and
precise program analysis.
Jie
Powered by blists - more mailing lists