linux-kernel - Re: [PATCH 2/8] perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d72fc2bd-6200-1ee6-31d3-d81d7b669453@os.amperecomputing.com>
Date: Thu, 29 Jan 2026 14:20:03 -0800 (PST)
From: Ilkka Koskinen <ilkka@...amperecomputing.com>
To: Besar Wicaksono <bwicaksono@...dia.com>
cc: will@...nel.org, suzuki.poulose@....com, robin.murphy@....com, 
    ilkka@...amperecomputing.com, linux-arm-kernel@...ts.infradead.org, 
    linux-kernel@...r.kernel.org, linux-tegra@...r.kernel.org, 
    mark.rutland@....com, treding@...dia.com, jonathanh@...dia.com, 
    vsethi@...dia.com, rwiley@...dia.com, sdonthineni@...dia.com, 
    skelley@...dia.com, ywan@...dia.com, mochs@...dia.com, nirmoyd@...dia.com
Subject: Re: [PATCH 2/8] perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU


Hi Besar,

On Mon, 26 Jan 2026, Besar Wicaksono wrote:
> Adds Unified Coherent Fabric PMU support in Tegra410 SOC.
>
> Signed-off-by: Besar Wicaksono <bwicaksono@...dia.com>

Looks good to me

Reviewed-by: Ilkka Koskinen <ilkka@...amperecomputing.com>

--Ilkka


> ---
> Documentation/admin-guide/perf/index.rst      |   1 +
> .../admin-guide/perf/nvidia-tegra410-pmu.rst  | 106 ++++++++++++++++++
> drivers/perf/arm_cspmu/nvidia_cspmu.c         |  90 ++++++++++++++-
> 3 files changed, 196 insertions(+), 1 deletion(-)
> create mode 100644 Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
>
> diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
> index c407bb44b08e..aa12708ddb96 100644
> --- a/Documentation/admin-guide/perf/index.rst
> +++ b/Documentation/admin-guide/perf/index.rst
> @@ -25,6 +25,7 @@ Performance monitor support
>    alibaba_pmu
>    dwc_pcie_pmu
>    nvidia-tegra241-pmu
> +   nvidia-tegra410-pmu
>    meson-ddr-pmu
>    cxl
>    ampere_cspmu
> diff --git a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
> new file mode 100644
> index 000000000000..7b7ba5700ca1
> --- /dev/null
> +++ b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
> @@ -0,0 +1,106 @@
> +=====================================================================
> +NVIDIA Tegra410 SoC Uncore Performance Monitoring Unit (PMU)
> +=====================================================================
> +
> +The NVIDIA Tegra410 SoC includes various system PMUs to measure key performance
> +metrics like memory bandwidth, latency, and utilization:
> +
> +* Unified Coherence Fabric (UCF)
> +
> +PMU Driver
> +----------
> +
> +The PMU driver describes the available events and configuration of each PMU in
> +sysfs. Please see the sections below to get the sysfs path of each PMU. Like
> +other uncore PMU drivers, the driver provides "cpumask" sysfs attribute to show
> +the CPU id used to handle the PMU event. There is also "associated_cpus"
> +sysfs attribute, which contains a list of CPUs associated with the PMU instance.
> +
> +UCF PMU
> +-------
> +
> +The Unified Coherence Fabric (UCF) in the NVIDIA Tegra410 SoC serves as a
> +distributed cache, last level for CPU Memory and CXL Memory, and cache coherent
> +interconnect that supports hardware coherence across multiple coherently caching
> +agents, including:
> +
> +  * CPU clusters
> +  * GPU
> +  * PCIe Ordering Controller Unit (OCU)
> +  * Other IO-coherent requesters
> +
> +The events and configuration options of this PMU device are described in sysfs,
> +see /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>.
> +
> +Some of the events available in this PMU can be used to measure bandwidth and
> +utilization:
> +
> +  * slc_access_rd: count the number of read requests to SLC.
> +  * slc_access_wr: count the number of write requests to SLC.
> +  * slc_bytes_rd: count the number of bytes transferred by slc_access_rd.
> +  * slc_bytes_wr: count the number of bytes transferred by slc_access_wr.
> +  * mem_access_rd: count the number of read requests to local or remote memory.
> +  * mem_access_wr: count the number of write requests to local or remote memory.
> +  * mem_bytes_rd: count the number of bytes transferred by mem_access_rd.
> +  * mem_bytes_wr: count the number of bytes transferred by mem_access_wr.
> +  * cycles: counts the UCF cycles.
> +
> +The average bandwidth is calculated as::
> +
> +   AVG_SLC_READ_BANDWIDTH_IN_GBPS = SLC_BYTES_RD / ELAPSED_TIME_IN_NS
> +   AVG_SLC_WRITE_BANDWIDTH_IN_GBPS = SLC_BYTES_WR / ELAPSED_TIME_IN_NS
> +   AVG_MEM_READ_BANDWIDTH_IN_GBPS = MEM_BYTES_RD / ELAPSED_TIME_IN_NS
> +   AVG_MEM_WRITE_BANDWIDTH_IN_GBPS = MEM_BYTES_WR / ELAPSED_TIME_IN_NS
> +
> +The average request rate is calculated as::
> +
> +   AVG_SLC_READ_REQUEST_RATE = SLC_ACCESS_RD / CYCLES
> +   AVG_SLC_WRITE_REQUEST_RATE = SLC_ACCESS_WR / CYCLES
> +   AVG_MEM_READ_REQUEST_RATE = MEM_ACCESS_RD / CYCLES
> +   AVG_MEM_WRITE_REQUEST_RATE = MEM_ACCESS_WR / CYCLES
> +
> +More details about what other events are available can be found in Tegra410 SoC
> +technical reference manual.
> +
> +The events can be filtered based on source or destination. The source filter
> +indicates the traffic initiator to the SLC, e.g local CPU, non-CPU device, or
> +remote socket. The destination filter specifies the destination memory type,
> +e.g. local system memory (CMEM), local GPU memory (GMEM), or remote memory. The
> +local/remote classification of the destination filter is based on the home
> +socket of the address, not where the data actually resides. The available
> +filters are described in
> +/sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>/format/.
> +
> +The list of UCF PMU event filters:
> +
> +* Source filter:
> +
> +  * src_loc_cpu: if set, count events from local CPU
> +  * src_loc_noncpu: if set, count events from local non-CPU device
> +  * src_rem: if set, count events from CPU, GPU, PCIE devices of remote socket
> +
> +* Destination filter:
> +
> +  * dst_loc_cmem: if set, count events to local system memory (CMEM) address
> +  * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
> +  * dst_loc_other: if set, count events to local CXL memory address
> +  * dst_rem: if set, count events to CPU, GPU, and CXL memory address of remote socket
> +
> +If the source is not specified, the PMU will count events from all sources. If
> +the destination is not specified, the PMU will count events to all destinations.
> +
> +Example usage:
> +
> +* Count event id 0x0 in socket 0 from all sources and to all destinations::
> +
> +    perf stat -a -e nvidia_ucf_pmu_0/event=0x0/
> +
> +* Count event id 0x0 in socket 0 with source filter = local CPU and destination
> +  filter = local system memory (CMEM)::
> +
> +    perf stat -a -e nvidia_ucf_pmu_0/event=0x0,src_loc_cpu=0x1,dst_loc_cmem=0x1/
> +
> +* Count event id 0x0 in socket 1 with source filter = local non-CPU device and
> +  destination filter = remote memory::
> +
> +    perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/
> diff --git a/drivers/perf/arm_cspmu/nvidia_cspmu.c b/drivers/perf/arm_cspmu/nvidia_cspmu.c
> index e06a06d3407b..c67667097a3c 100644
> --- a/drivers/perf/arm_cspmu/nvidia_cspmu.c
> +++ b/drivers/perf/arm_cspmu/nvidia_cspmu.c
> @@ -1,6 +1,6 @@
> // SPDX-License-Identifier: GPL-2.0
> /*
> - * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
> + * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
>  *
>  */
>
> @@ -21,6 +21,13 @@
> #define NV_CNVL_PORT_COUNT           4ULL
> #define NV_CNVL_FILTER_ID_MASK       GENMASK_ULL(NV_CNVL_PORT_COUNT - 1, 0)
>
> +#define NV_UCF_SRC_COUNT             3ULL
> +#define NV_UCF_DST_COUNT             4ULL
> +#define NV_UCF_FILTER_ID_MASK        GENMASK_ULL(11, 0)
> +#define NV_UCF_FILTER_SRC            GENMASK_ULL(2, 0)
> +#define NV_UCF_FILTER_DST            GENMASK_ULL(11, 8)
> +#define NV_UCF_FILTER_DEFAULT        (NV_UCF_FILTER_SRC | NV_UCF_FILTER_DST)
> +
> #define NV_GENERIC_FILTER_ID_MASK    GENMASK_ULL(31, 0)
>
> #define NV_PRODID_MASK	(PMIIDR_PRODUCTID | PMIIDR_VARIANT | PMIIDR_REVISION)
> @@ -124,6 +131,37 @@ static struct attribute *mcf_pmu_event_attrs[] = {
> 	NULL,
> };
>
> +static struct attribute *ucf_pmu_event_attrs[] = {
> +	ARM_CSPMU_EVENT_ATTR(bus_cycles,            0x1D),
> +
> +	ARM_CSPMU_EVENT_ATTR(slc_allocate,          0xF0),
> +	ARM_CSPMU_EVENT_ATTR(slc_wb,                0xF3),
> +	ARM_CSPMU_EVENT_ATTR(slc_refill_rd,         0x109),
> +	ARM_CSPMU_EVENT_ATTR(slc_refill_wr,         0x10A),
> +	ARM_CSPMU_EVENT_ATTR(slc_hit_rd,            0x119),
> +
> +	ARM_CSPMU_EVENT_ATTR(slc_access_dataless,   0x183),
> +	ARM_CSPMU_EVENT_ATTR(slc_access_atomic,     0x184),
> +
> +	ARM_CSPMU_EVENT_ATTR(slc_access,            0xF2),
> +	ARM_CSPMU_EVENT_ATTR(slc_access_rd,         0x111),
> +	ARM_CSPMU_EVENT_ATTR(slc_access_wr,         0x112),
> +	ARM_CSPMU_EVENT_ATTR(slc_bytes_rd,          0x113),
> +	ARM_CSPMU_EVENT_ATTR(slc_bytes_wr,          0x114),
> +
> +	ARM_CSPMU_EVENT_ATTR(mem_access_rd,         0x121),
> +	ARM_CSPMU_EVENT_ATTR(mem_access_wr,         0x122),
> +	ARM_CSPMU_EVENT_ATTR(mem_bytes_rd,          0x123),
> +	ARM_CSPMU_EVENT_ATTR(mem_bytes_wr,          0x124),
> +
> +	ARM_CSPMU_EVENT_ATTR(local_snoop,           0x180),
> +	ARM_CSPMU_EVENT_ATTR(ext_snp_access,        0x181),
> +	ARM_CSPMU_EVENT_ATTR(ext_snp_evict,         0x182),
> +
> +	ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
> +	NULL,
> +};
> +
> static struct attribute *generic_pmu_event_attrs[] = {
> 	ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
> 	NULL,
> @@ -152,6 +190,18 @@ static struct attribute *cnvlink_pmu_format_attrs[] = {
> 	NULL,
> };
>
> +static struct attribute *ucf_pmu_format_attrs[] = {
> +	ARM_CSPMU_FORMAT_EVENT_ATTR,
> +	ARM_CSPMU_FORMAT_ATTR(src_loc_noncpu, "config1:0"),
> +	ARM_CSPMU_FORMAT_ATTR(src_loc_cpu, "config1:1"),
> +	ARM_CSPMU_FORMAT_ATTR(src_rem, "config1:2"),
> +	ARM_CSPMU_FORMAT_ATTR(dst_loc_cmem, "config1:8"),
> +	ARM_CSPMU_FORMAT_ATTR(dst_loc_gmem, "config1:9"),
> +	ARM_CSPMU_FORMAT_ATTR(dst_loc_other, "config1:10"),
> +	ARM_CSPMU_FORMAT_ATTR(dst_rem, "config1:11"),
> +	NULL,
> +};
> +
> static struct attribute *generic_pmu_format_attrs[] = {
> 	ARM_CSPMU_FORMAT_EVENT_ATTR,
> 	ARM_CSPMU_FORMAT_FILTER_ATTR,
> @@ -236,6 +286,27 @@ static void nv_cspmu_set_cc_filter(struct arm_cspmu *cspmu,
> 	writel(filter, cspmu->base0 + PMCCFILTR);
> }
>
> +static u32 ucf_pmu_event_filter(const struct perf_event *event)
> +{
> +	u32 ret, filter, src, dst;
> +
> +	filter = nv_cspmu_event_filter(event);
> +
> +	/* Monitor all sources if none is selected. */
> +	src = FIELD_GET(NV_UCF_FILTER_SRC, filter);
> +	if (src == 0)
> +		src = GENMASK_ULL(NV_UCF_SRC_COUNT - 1, 0);
> +
> +	/* Monitor all destinations if none is selected. */
> +	dst = FIELD_GET(NV_UCF_FILTER_DST, filter);
> +	if (dst == 0)
> +		dst = GENMASK_ULL(NV_UCF_DST_COUNT - 1, 0);
> +
> +	ret = FIELD_PREP(NV_UCF_FILTER_SRC, src);
> +	ret |= FIELD_PREP(NV_UCF_FILTER_DST, dst);
> +
> +	return ret;
> +}
>
> enum nv_cspmu_name_fmt {
> 	NAME_FMT_GENERIC,
> @@ -342,6 +413,23 @@ static const struct nv_cspmu_match nv_cspmu_match[] = {
> 		.init_data = NULL
> 	  },
> 	},
> +	{
> +	  .prodid = 0x2CF20000,
> +	  .prodid_mask = NV_PRODID_MASK,
> +	  .name_pattern = "nvidia_ucf_pmu_%u",
> +	  .name_fmt = NAME_FMT_SOCKET,
> +	  .template_ctx = {
> +		.event_attr = ucf_pmu_event_attrs,
> +		.format_attr = ucf_pmu_format_attrs,
> +		.filter_mask = NV_UCF_FILTER_ID_MASK,
> +		.filter_default_val = NV_UCF_FILTER_DEFAULT,
> +		.filter2_mask = 0x0,
> +		.filter2_default_val = 0x0,
> +		.get_filter = ucf_pmu_event_filter,
> +		.get_filter2 = NULL,
> +		.init_data = NULL
> +	  },
> +	},
> 	{
> 	  .prodid = 0,
> 	  .prodid_mask = 0,
> -- 
> 2.43.0
>
>