[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d72fc2bd-6200-1ee6-31d3-d81d7b669453@os.amperecomputing.com>
Date: Thu, 29 Jan 2026 14:20:03 -0800 (PST)
From: Ilkka Koskinen <ilkka@...amperecomputing.com>
To: Besar Wicaksono <bwicaksono@...dia.com>
cc: will@...nel.org, suzuki.poulose@....com, robin.murphy@....com,
ilkka@...amperecomputing.com, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org, linux-tegra@...r.kernel.org,
mark.rutland@....com, treding@...dia.com, jonathanh@...dia.com,
vsethi@...dia.com, rwiley@...dia.com, sdonthineni@...dia.com,
skelley@...dia.com, ywan@...dia.com, mochs@...dia.com, nirmoyd@...dia.com
Subject: Re: [PATCH 2/8] perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU
Hi Besar,
On Mon, 26 Jan 2026, Besar Wicaksono wrote:
> Adds Unified Coherent Fabric PMU support in Tegra410 SOC.
>
> Signed-off-by: Besar Wicaksono <bwicaksono@...dia.com>
Looks good to me
Reviewed-by: Ilkka Koskinen <ilkka@...amperecomputing.com>
--Ilkka
> ---
> Documentation/admin-guide/perf/index.rst | 1 +
> .../admin-guide/perf/nvidia-tegra410-pmu.rst | 106 ++++++++++++++++++
> drivers/perf/arm_cspmu/nvidia_cspmu.c | 90 ++++++++++++++-
> 3 files changed, 196 insertions(+), 1 deletion(-)
> create mode 100644 Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
>
> diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
> index c407bb44b08e..aa12708ddb96 100644
> --- a/Documentation/admin-guide/perf/index.rst
> +++ b/Documentation/admin-guide/perf/index.rst
> @@ -25,6 +25,7 @@ Performance monitor support
> alibaba_pmu
> dwc_pcie_pmu
> nvidia-tegra241-pmu
> + nvidia-tegra410-pmu
> meson-ddr-pmu
> cxl
> ampere_cspmu
> diff --git a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
> new file mode 100644
> index 000000000000..7b7ba5700ca1
> --- /dev/null
> +++ b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
> @@ -0,0 +1,106 @@
> +=====================================================================
> +NVIDIA Tegra410 SoC Uncore Performance Monitoring Unit (PMU)
> +=====================================================================
> +
> +The NVIDIA Tegra410 SoC includes various system PMUs to measure key performance
> +metrics like memory bandwidth, latency, and utilization:
> +
> +* Unified Coherence Fabric (UCF)
> +
> +PMU Driver
> +----------
> +
> +The PMU driver describes the available events and configuration of each PMU in
> +sysfs. Please see the sections below to get the sysfs path of each PMU. Like
> +other uncore PMU drivers, the driver provides "cpumask" sysfs attribute to show
> +the CPU id used to handle the PMU event. There is also "associated_cpus"
> +sysfs attribute, which contains a list of CPUs associated with the PMU instance.
> +
> +UCF PMU
> +-------
> +
> +The Unified Coherence Fabric (UCF) in the NVIDIA Tegra410 SoC serves as a
> +distributed cache, last level for CPU Memory and CXL Memory, and cache coherent
> +interconnect that supports hardware coherence across multiple coherently caching
> +agents, including:
> +
> + * CPU clusters
> + * GPU
> + * PCIe Ordering Controller Unit (OCU)
> + * Other IO-coherent requesters
> +
> +The events and configuration options of this PMU device are described in sysfs,
> +see /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>.
> +
> +Some of the events available in this PMU can be used to measure bandwidth and
> +utilization:
> +
> + * slc_access_rd: count the number of read requests to SLC.
> + * slc_access_wr: count the number of write requests to SLC.
> + * slc_bytes_rd: count the number of bytes transferred by slc_access_rd.
> + * slc_bytes_wr: count the number of bytes transferred by slc_access_wr.
> + * mem_access_rd: count the number of read requests to local or remote memory.
> + * mem_access_wr: count the number of write requests to local or remote memory.
> + * mem_bytes_rd: count the number of bytes transferred by mem_access_rd.
> + * mem_bytes_wr: count the number of bytes transferred by mem_access_wr.
> + * cycles: counts the UCF cycles.
> +
> +The average bandwidth is calculated as::
> +
> + AVG_SLC_READ_BANDWIDTH_IN_GBPS = SLC_BYTES_RD / ELAPSED_TIME_IN_NS
> + AVG_SLC_WRITE_BANDWIDTH_IN_GBPS = SLC_BYTES_WR / ELAPSED_TIME_IN_NS
> + AVG_MEM_READ_BANDWIDTH_IN_GBPS = MEM_BYTES_RD / ELAPSED_TIME_IN_NS
> + AVG_MEM_WRITE_BANDWIDTH_IN_GBPS = MEM_BYTES_WR / ELAPSED_TIME_IN_NS
> +
> +The average request rate is calculated as::
> +
> + AVG_SLC_READ_REQUEST_RATE = SLC_ACCESS_RD / CYCLES
> + AVG_SLC_WRITE_REQUEST_RATE = SLC_ACCESS_WR / CYCLES
> + AVG_MEM_READ_REQUEST_RATE = MEM_ACCESS_RD / CYCLES
> + AVG_MEM_WRITE_REQUEST_RATE = MEM_ACCESS_WR / CYCLES
> +
> +More details about what other events are available can be found in Tegra410 SoC
> +technical reference manual.
> +
> +The events can be filtered based on source or destination. The source filter
> +indicates the traffic initiator to the SLC, e.g local CPU, non-CPU device, or
> +remote socket. The destination filter specifies the destination memory type,
> +e.g. local system memory (CMEM), local GPU memory (GMEM), or remote memory. The
> +local/remote classification of the destination filter is based on the home
> +socket of the address, not where the data actually resides. The available
> +filters are described in
> +/sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>/format/.
> +
> +The list of UCF PMU event filters:
> +
> +* Source filter:
> +
> + * src_loc_cpu: if set, count events from local CPU
> + * src_loc_noncpu: if set, count events from local non-CPU device
> + * src_rem: if set, count events from CPU, GPU, PCIE devices of remote socket
> +
> +* Destination filter:
> +
> + * dst_loc_cmem: if set, count events to local system memory (CMEM) address
> + * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
> + * dst_loc_other: if set, count events to local CXL memory address
> + * dst_rem: if set, count events to CPU, GPU, and CXL memory address of remote socket
> +
> +If the source is not specified, the PMU will count events from all sources. If
> +the destination is not specified, the PMU will count events to all destinations.
> +
> +Example usage:
> +
> +* Count event id 0x0 in socket 0 from all sources and to all destinations::
> +
> + perf stat -a -e nvidia_ucf_pmu_0/event=0x0/
> +
> +* Count event id 0x0 in socket 0 with source filter = local CPU and destination
> + filter = local system memory (CMEM)::
> +
> + perf stat -a -e nvidia_ucf_pmu_0/event=0x0,src_loc_cpu=0x1,dst_loc_cmem=0x1/
> +
> +* Count event id 0x0 in socket 1 with source filter = local non-CPU device and
> + destination filter = remote memory::
> +
> + perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/
> diff --git a/drivers/perf/arm_cspmu/nvidia_cspmu.c b/drivers/perf/arm_cspmu/nvidia_cspmu.c
> index e06a06d3407b..c67667097a3c 100644
> --- a/drivers/perf/arm_cspmu/nvidia_cspmu.c
> +++ b/drivers/perf/arm_cspmu/nvidia_cspmu.c
> @@ -1,6 +1,6 @@
> // SPDX-License-Identifier: GPL-2.0
> /*
> - * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
> + * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
> *
> */
>
> @@ -21,6 +21,13 @@
> #define NV_CNVL_PORT_COUNT 4ULL
> #define NV_CNVL_FILTER_ID_MASK GENMASK_ULL(NV_CNVL_PORT_COUNT - 1, 0)
>
> +#define NV_UCF_SRC_COUNT 3ULL
> +#define NV_UCF_DST_COUNT 4ULL
> +#define NV_UCF_FILTER_ID_MASK GENMASK_ULL(11, 0)
> +#define NV_UCF_FILTER_SRC GENMASK_ULL(2, 0)
> +#define NV_UCF_FILTER_DST GENMASK_ULL(11, 8)
> +#define NV_UCF_FILTER_DEFAULT (NV_UCF_FILTER_SRC | NV_UCF_FILTER_DST)
> +
> #define NV_GENERIC_FILTER_ID_MASK GENMASK_ULL(31, 0)
>
> #define NV_PRODID_MASK (PMIIDR_PRODUCTID | PMIIDR_VARIANT | PMIIDR_REVISION)
> @@ -124,6 +131,37 @@ static struct attribute *mcf_pmu_event_attrs[] = {
> NULL,
> };
>
> +static struct attribute *ucf_pmu_event_attrs[] = {
> + ARM_CSPMU_EVENT_ATTR(bus_cycles, 0x1D),
> +
> + ARM_CSPMU_EVENT_ATTR(slc_allocate, 0xF0),
> + ARM_CSPMU_EVENT_ATTR(slc_wb, 0xF3),
> + ARM_CSPMU_EVENT_ATTR(slc_refill_rd, 0x109),
> + ARM_CSPMU_EVENT_ATTR(slc_refill_wr, 0x10A),
> + ARM_CSPMU_EVENT_ATTR(slc_hit_rd, 0x119),
> +
> + ARM_CSPMU_EVENT_ATTR(slc_access_dataless, 0x183),
> + ARM_CSPMU_EVENT_ATTR(slc_access_atomic, 0x184),
> +
> + ARM_CSPMU_EVENT_ATTR(slc_access, 0xF2),
> + ARM_CSPMU_EVENT_ATTR(slc_access_rd, 0x111),
> + ARM_CSPMU_EVENT_ATTR(slc_access_wr, 0x112),
> + ARM_CSPMU_EVENT_ATTR(slc_bytes_rd, 0x113),
> + ARM_CSPMU_EVENT_ATTR(slc_bytes_wr, 0x114),
> +
> + ARM_CSPMU_EVENT_ATTR(mem_access_rd, 0x121),
> + ARM_CSPMU_EVENT_ATTR(mem_access_wr, 0x122),
> + ARM_CSPMU_EVENT_ATTR(mem_bytes_rd, 0x123),
> + ARM_CSPMU_EVENT_ATTR(mem_bytes_wr, 0x124),
> +
> + ARM_CSPMU_EVENT_ATTR(local_snoop, 0x180),
> + ARM_CSPMU_EVENT_ATTR(ext_snp_access, 0x181),
> + ARM_CSPMU_EVENT_ATTR(ext_snp_evict, 0x182),
> +
> + ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
> + NULL,
> +};
> +
> static struct attribute *generic_pmu_event_attrs[] = {
> ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
> NULL,
> @@ -152,6 +190,18 @@ static struct attribute *cnvlink_pmu_format_attrs[] = {
> NULL,
> };
>
> +static struct attribute *ucf_pmu_format_attrs[] = {
> + ARM_CSPMU_FORMAT_EVENT_ATTR,
> + ARM_CSPMU_FORMAT_ATTR(src_loc_noncpu, "config1:0"),
> + ARM_CSPMU_FORMAT_ATTR(src_loc_cpu, "config1:1"),
> + ARM_CSPMU_FORMAT_ATTR(src_rem, "config1:2"),
> + ARM_CSPMU_FORMAT_ATTR(dst_loc_cmem, "config1:8"),
> + ARM_CSPMU_FORMAT_ATTR(dst_loc_gmem, "config1:9"),
> + ARM_CSPMU_FORMAT_ATTR(dst_loc_other, "config1:10"),
> + ARM_CSPMU_FORMAT_ATTR(dst_rem, "config1:11"),
> + NULL,
> +};
> +
> static struct attribute *generic_pmu_format_attrs[] = {
> ARM_CSPMU_FORMAT_EVENT_ATTR,
> ARM_CSPMU_FORMAT_FILTER_ATTR,
> @@ -236,6 +286,27 @@ static void nv_cspmu_set_cc_filter(struct arm_cspmu *cspmu,
> writel(filter, cspmu->base0 + PMCCFILTR);
> }
>
> +static u32 ucf_pmu_event_filter(const struct perf_event *event)
> +{
> + u32 ret, filter, src, dst;
> +
> + filter = nv_cspmu_event_filter(event);
> +
> + /* Monitor all sources if none is selected. */
> + src = FIELD_GET(NV_UCF_FILTER_SRC, filter);
> + if (src == 0)
> + src = GENMASK_ULL(NV_UCF_SRC_COUNT - 1, 0);
> +
> + /* Monitor all destinations if none is selected. */
> + dst = FIELD_GET(NV_UCF_FILTER_DST, filter);
> + if (dst == 0)
> + dst = GENMASK_ULL(NV_UCF_DST_COUNT - 1, 0);
> +
> + ret = FIELD_PREP(NV_UCF_FILTER_SRC, src);
> + ret |= FIELD_PREP(NV_UCF_FILTER_DST, dst);
> +
> + return ret;
> +}
>
> enum nv_cspmu_name_fmt {
> NAME_FMT_GENERIC,
> @@ -342,6 +413,23 @@ static const struct nv_cspmu_match nv_cspmu_match[] = {
> .init_data = NULL
> },
> },
> + {
> + .prodid = 0x2CF20000,
> + .prodid_mask = NV_PRODID_MASK,
> + .name_pattern = "nvidia_ucf_pmu_%u",
> + .name_fmt = NAME_FMT_SOCKET,
> + .template_ctx = {
> + .event_attr = ucf_pmu_event_attrs,
> + .format_attr = ucf_pmu_format_attrs,
> + .filter_mask = NV_UCF_FILTER_ID_MASK,
> + .filter_default_val = NV_UCF_FILTER_DEFAULT,
> + .filter2_mask = 0x0,
> + .filter2_default_val = 0x0,
> + .get_filter = ucf_pmu_event_filter,
> + .get_filter2 = NULL,
> + .init_data = NULL
> + },
> + },
> {
> .prodid = 0,
> .prodid_mask = 0,
> --
> 2.43.0
>
>
Powered by blists - more mailing lists