lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220922165820.000017b6@huawei.com>
Date:   Thu, 22 Sep 2022 16:58:20 +0100
From:   Jonathan Cameron <Jonathan.Cameron@...wei.com>
To:     Shuai Xue <xueshuai@...ux.alibaba.com>
CC:     <will@...nel.org>, <linux-arm-kernel@...ts.infradead.org>,
        <linux-kernel@...r.kernel.org>, <rdunlap@...radead.org>,
        <robin.murphy@....com>, <mark.rutland@....com>,
        <baolin.wang@...ux.alibaba.com>, <zhuo.song@...ux.alibaba.com>,
        <linux-pci@...r.kernel.org>, Bjorn Helgaas <helgaas@...nel.org>
Subject: Re: [PATCH v1 2/3] drivers/perf: add DesignWare PCIe PMU driver

On Sat, 17 Sep 2022 20:10:35 +0800
Shuai Xue <xueshuai@...ux.alibaba.com> wrote:

> This commit adds the PCIe Performance Monitoring Unit (PMU) driver support
> for T-Head Yitian SoC chip. Yitian is based on the Synopsys PCI Express
> Core controller IP which provides statistics feature. The PMU is not a PCIe
> Root Complex integrated End Point(RCiEP) device but only register counters
> provided by each PCIe Root Port.
> 
> To facilitate collection of statistics the controller provides the
> following two features for each Root Port:
> 
> - Time Based Analysis (RX/TX data throughput and time spent in each
>   low-power LTSSM state)
> - Event counters (Error and Non-Error for lanes)
> 
> Note, only one counter for each type.
> 
> This driver add PMU devices for each PCIe Root Port. And the PMU device is
> named based the BDF of Root Port. For example,
> 
>     10:00.0 PCI bridge: Device 1ded:8000 (rev 01)
> 
> the PMU device name for this Root Port is pcie_bdf_100000.
> 
> Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
> 
>     $# perf stat -a -e pcie_bdf_200/Rx_PCIe_TLP_Data_Payload/
> 
> average RX bandwidth can be calculated like this:
> 
>     PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
> 
> Signed-off-by: Shuai Xue <xueshuai@...ux.alibaba.com>

+CC linux-pci list and Bjorn.

Question in here which I've been meaning to address for other reasons
around how to register 'extra features' on pci ports.

This particular PMU is in config space in a Vendor Specific Extended
Capability.

I've focused on that aspect for this review rather than the perf parts.
We'll need to figure that story out first as doing this from a bus walk
makes triggered of a platform driver is not the way I'd expect to see
this work.


> diff --git a/drivers/perf/dwc_pcie_pmu.c b/drivers/perf/dwc_pcie_pmu.c
> new file mode 100644
> index 000000000000..81e534be13fa
> --- /dev/null
> +++ b/drivers/perf/dwc_pcie_pmu.c
> @@ -0,0 +1,976 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Synopsys DesignWare PCIe PMU driver
> + *
> + * Copyright (C) 2021, 2022 Alibaba Inc.
> + */
> + 
> +#include <linux/pci.h>
> +#include <linux/bitfield.h>
> +#include <linux/bitops.h>
> +#include <linux/cpuhotplug.h>
> +#include <linux/cpumask.h>
> +#include <linux/device.h>
> +#include <linux/errno.h>
> +#include <linux/kernel.h>
> +#include <linux/list.h>
> +#include <linux/perf_event.h>
> +#include <linux/platform_device.h>
> +#include <linux/smp.h>
> +#include <linux/sysfs.h>
> +#include <linux/types.h>
> +
> +#define DRV_NAME				"dwc_pcie_pmu"
> +#define DEV_NAME				"dwc_pcie_pmu"
Put these strings where they are used.  That's where people will look for them...

> +#define RP_NUM_MAX				32 /* 2die * 4RC * 4Ctrol */

This driver is 'almost' generic. So if you an avoid defines based on a particular
platform that's definitely good!

> +#define ATTRI_NAME_MAX_SIZE			32
> +
> +#define DWC_PCIE_VSEC_ID			0x02
> +#define DWC_PCIE_VSEC_REV			0x04

I wouldn't define the REV like this. Put the number inline so we
can clearly see this is revision 4.  VSEC_ID won't change so a
define for that is fine.


> +
> +#define DWC_PCIE_LINK_CAPABILITIES_REG		0xC
This is PCIE spec defined.  Put these in a common header.
> +#define DWC_PCIE_LANE_SHIFT			4
> +#define DWC_PCIE_LANE_MASK			GENMASK(9, 4)
> +
> +#define DWC_PCIE_EVENT_CNT_CTRL			0x8
> +#define DWC_PCIE__CNT_EVENT_SELECT_SHIFT	16

Why double __?  If point is to separate register from fields, then
naming works better
DWC_PCIE_EVENT_CNT_CTRL_REG
DWC_PCIE_EVENT_CNT_CTRL_EV_SELECT_MSK etc


> +#define DWC_PCIE__CNT_EVENT_SELECT_MASK		GENMASK(27, 16)
> +#define DWC_PCIE__CNT_LANE_SELECT_SHIFT		8
> +#define DWC_PCIE__CNT_LANE_SELECT_MASK		GENMASK(11, 8)
> +#define DWC_PCIE__CNT_STATUS_SHIFT		7
> +#define DWC_PCIE__CNT_STATUS_MASK		BIT(7)
> +#define DWC_PCIE__CNT_ENABLE_SHIFT		2

With FIELD_PREP() / FIELD_GET() you should never need to define the shifts.
They will be extracted from the masks as needed.

> +#define DWC_PCIE__CNT_ENABLE_MASK		GENMASK(4, 2)
> +#define DWC_PCIE_PER_EVENT_OFF			(0x1 << DWC_PCIE__CNT_ENABLE_SHIFT)
FIELD_PREP() / FIELD_GET() combined with defines for the values.

#define DWC_PCIE_CNT_ENABLE_MASK ...

> +#define DWC_PCIE_PER_EVENT_ON			(0x3 << DWC_PCIE__CNT_ENABLE_SHIFT)
> +#define DWC_PCIE_EVENT_CLEAR_MASK		GENMASK(1, 0)
> +
> +#define DWC_PCIE_EVENT_CNT_DATA			0xC
> +
> +#define DWC_PCIE_TIME_BASED_ANALYSIS_CTRL	0x10
> +#define DWC_PCIE__TIME_BASED_REPORT_SELECT_SHIFT	24
> +#define DWC_PCIE__TIME_BASED_REPORT_SELECT_MASK	GENMASK(31, 24)
> +#define DWC_PCIE__TIME_BASED_DURATION_SHIFT	8
> +#define DWC_PCIE__TIME_BASED_DURATION_SELECT	GENMASK(15, 8)
> +#define DWC_PCIE_DURATION_MANUAL_CTRL		0x0
> +#define DWC_PCIE_DURATION_1MS			0x1
> +#define DWC_PCIE_DURATION_10MS			0x2
> +#define DWC_PCIE_DURATION_100MS			0x3
> +#define DWC_PCIE_DURATION_1S			0x4
> +#define DWC_PCIE_DURATION_2S			0x5
> +#define DWC_PCIE_DURATION_4S			0x6
> +#define DWC_PCIE_DURATION_4US			0xff
> +#define DWC_PCIE__TIME_BASED_COUNTER_ENABLE	1
> +
> +#define DWC_PCIE_TIME_BASED_ANALYSIS_DATA_REG_LOW	0x14
> +#define DWC_PCIE_TIME_BASED_ANALYSIS_DATA_REG_HIGH	0x18
> +
> +/* Event attributes */
> +#define DWC_PCIE_CONFIG_EVENTID			GENMASK(15, 0)
> +#define DWC_PCIE_CONFIG_TYPE			GENMASK(19, 16)
> +#define DWC_PCIE_CONFIG_LANE			GENMASK(27, 20)
> +
> +#define DWC_PCIE_EVENT_ID(event)	FIELD_GET(DWC_PCIE_CONFIG_EVENTID, (event)->attr.config)
> +#define DWC_PCIE_EVENT_TYPE(event)	FIELD_GET(DWC_PCIE_CONFIG_TYPE, (event)->attr.config)
> +#define DWC_PCIE_EVENT_LANE(event)	FIELD_GET(DWC_PCIE_CONFIG_LANE, (event)->attr.config)
> +
> +#define DWC_PCIE_PMU_HAS_REGISTER		1
> +
> +enum dwc_pcie_event_type {
> +	DWC_PCIE_TYPE_INVALID,
> +	DWC_PCIE_TIME_BASE_EVENT,
> +	DWC_PCIE_LANE_EVENT,
> +};
> +
> +struct dwc_event_counters {
> +	const char name[32];
> +	u32 event_id;
> +};
> +
> +struct dwc_pcie_pmu {
> +	struct hlist_node node;
> +	unsigned int on_cpu;
> +	struct pmu pmu;
> +	struct device *dev;
> +};
> +
> +struct dwc_pcie_info_table {
> +	u32 bdf;
> +	u32 cap_pos;
> +	u32 num_lanes;
> +	struct pci_dev *pdev;
> +	struct dwc_pcie_pmu pcie_pmu;
> +	u8 pmu_is_register;
> +	struct perf_event *event;
> +
> +	struct dwc_pcie_event_attr *lane_event_attrs;
> +	struct attribute **pcie_pmu_event_attrs;
> +	struct attribute_group pcie_pmu_event_attrs_group;
> +	const struct attribute_group *pcie_pmu_attr_groups[4];
> +};
> +
> +struct dwc_pcie_pmu_priv {
> +	struct device *dev;
> +	u32 pcie_ctrl_num;
> +	struct dwc_pcie_info_table *pcie_table;
> +};
> +
> +#define DWC_PCIE_CREATE_BDF(seg, bus, dev, func)	\
> +	(((seg) << 24) | (((bus) & 0xFF) << 16) | (((dev) & 0xFF) << 8) | (func))

Superficially this looks pretty standard.  Why is is DWC specific?

> +#define to_pcie_pmu(p) (container_of(p, struct dwc_pcie_pmu, pmu))

Prefix that name.  I'm hopeful we'll have a PCI SIG defined PMU one
day and when we do that macro belongs to that!
to_dwc_pcie_pmu() is possibly fine.


> +
> +static struct platform_device *dwc_pcie_pmu_dev;
> +static char *event_attr_name = "events";
> +


...

> +
> +static int dwc_pcie_find_ras_des_cap_position(struct pci_dev *pdev, int *pos)
> +{
> +	u32 header;
> +	int vsec = 0;
> +
> +	while ((vsec = pci_find_next_ext_capability(pdev, vsec,
> +						    PCI_EXT_CAP_ID_VNDR))) {

This probably belongs in the PCI core in a similar fashion to the DVSEC
helper.

> +		pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &header);
> +		/* Is the device part of a DesignWare Cores PCIe Controller ? */

Good question... This code doesn't check that.  VSEC ID is matched only with
the Vendor ID of the devices - unlike DVSEC where this would all be nice
and local.

> +		if (PCI_VNDR_HEADER_ID(header) == DWC_PCIE_VSEC_ID &&
> +		    PCI_VNDR_HEADER_REV(header) == DWC_PCIE_VSEC_REV) {
> +			*pos = vsec;
> +			return 0;
> +		}
> +	}
> +
> +	return -ENODEV;
> +}
> +
> +static int dwc_pcie_pmu_discover(struct dwc_pcie_pmu_priv *priv)
> +{
> +	int val, where, index = 0;
> +	struct pci_dev *pdev = NULL;
> +	struct dwc_pcie_info_table *pcie_info;
> +
> +	priv->pcie_table =
> +	    devm_kcalloc(priv->dev, RP_NUM_MAX, sizeof(*pcie_info), GFP_KERNEL);
> +	if (!priv->pcie_table)
> +		return -EINVAL;
> +
> +	pcie_info = priv->pcie_table;
> +	while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev)) != NULL &&
> +	       index < RP_NUM_MAX) {

This having a driver than then walks the pci topology to find root ports and add
extra stuff to them is not a clean solution.

The probing should be driven from the existing PCI driver topology.
There are a bunch of new features we need to add to ports in the near future
anyway - this would just be another one.
Same problem exists for CXL CPMU perf devices - so far we only support those
on end points, partly because we need a clean way to probe them on pci ports.

Whatever we come up with there will apply here as well.


> +		if (!pci_dev_is_rootport(pdev))
> +			continue;
> +
> +		pcie_info[index].bdf = dwc_pcie_get_bdf(pdev);
> +		pcie_info[index].pdev = pdev;
Probably want a sanity check this has a vendor ID appropriate the VSEC you are about
to look for.

> +
> +		if (dwc_pcie_find_ras_des_cap_position(pdev, &where))
> +			continue;
> +
> +		pcie_info[index].cap_pos = where;
> +
> +		pci_read_config_dword(pdev,
> +				pdev->pcie_cap + DWC_PCIE_LINK_CAPABILITIES_REG,
> +				&val);
> +		pcie_info[index].num_lanes =
> +			(val & DWC_PCIE_LANE_MASK) >> DWC_PCIE_LANE_SHIFT;

FIELD_GET()

> +		index++;
> +	}
> +
> +	if (!index)
> +		return -ENODEV;
> +
> +	priv->pcie_ctrl_num = index;
> +
> +	return 0;
> +}
> +
> +static inline int dwc_pcie_pmu_read_dword(struct dwc_pcie_info_table *pcie_info,
> +					  u32 reg, u32 *val)
> +{
> +	return pci_read_config_dword(pcie_info->pdev, pcie_info->cap_pos + reg,
> +				     val);
> +}
> +
> +static inline int dwc_pcie_pmu_write_dword(struct dwc_pcie_info_table
> +					   *pcie_info, u32 reg, u32 val)
> +{
> +	return pci_write_config_dword(pcie_info->pdev, pcie_info->cap_pos + reg,
> +				      val);
> +}

These two wrappers don't add a lot so I would drop them.

> +
> +static int dwc_pcie_pmu_set_event_id(struct dwc_pcie_info_table *pcie_info,
> +				     int event_id)
> +{
> +	int ret;
> +	u32 val;
> +
> +	ret = dwc_pcie_pmu_read_dword(pcie_info, DWC_PCIE_EVENT_CNT_CTRL, &val);
> +	if (ret) {
> +		pci_err(pcie_info->pdev, "PCIe read fail\n");
> +		return ret;
> +	}
> +
> +	val &= ~DWC_PCIE__CNT_ENABLE_MASK;
> +	val &= ~DWC_PCIE__CNT_EVENT_SELECT_MASK;
> +	val |= event_id << DWC_PCIE__CNT_EVENT_SELECT_SHIFT;

FIELD_PREP()

> +
> +	ret = dwc_pcie_pmu_write_dword(pcie_info, DWC_PCIE_EVENT_CNT_CTRL, val);
> +	if (ret)
> +		pci_err(pcie_info->pdev, "PCIe write fail\n");
> +
> +	return ret;
> +}

...

> +
> +static int dwc_pcie_pmu_read_base_time_counter(struct dwc_pcie_info_table
> +					       *pcie_info, u64 *counter)
> +{
> +	u32 ret, val;
> +
> +	ret = dwc_pcie_pmu_read_dword(pcie_info,
> +				      DWC_PCIE_TIME_BASED_ANALYSIS_DATA_REG_HIGH,
> +				      &val);
> +	if (ret) {
> +		pci_err(pcie_info->pdev, "PCIe read fail\n");
> +		return ret;
> +	}
> +
> +	*counter = val;
> +	*counter <<= 32;

This looks like you could get ripping between the upper and lower dwords.
What prevents that? Perhaps a comment to say why that's not a problem?

> +
> +	ret = dwc_pcie_pmu_read_dword(pcie_info,
> +				      DWC_PCIE_TIME_BASED_ANALYSIS_DATA_REG_LOW,
> +				      &val);
> +	if (ret) {
> +		pci_err(pcie_info->pdev, "PCIe read fail\n");
> +		return ret;
> +	}
> +
> +	*counter += val;
> +
> +	return ret;
> +}
...


> +static int __dwc_pcie_pmu_probe(struct dwc_pcie_pmu_priv *priv,
> +				struct dwc_pcie_info_table *pcie_info)
> +{
> +	int ret;
> +	char *name;
> +	struct dwc_pcie_pmu *pcie_pmu;
> +	struct device *dev;
> +
> +	if (!pcie_info || !pcie_info->pdev) {
> +		pci_err(pcie_info->pdev, "Input parameter is invalid\n");
> +		return -EINVAL;
> +	}
> +
> +	pcie_pmu = &pcie_info->pcie_pmu;
> +	dev = &pcie_info->pdev->dev;
> +
> +	ret = dwc_pcie_pmu_attr_init(priv, pcie_info);
> +	if (ret) {
> +		pci_err(pcie_info->pdev, "PMU attr init fail ret=%d\n", ret);
> +		return ret;
> +	}
> +
> +	pcie_pmu->dev = dev;
> +	pcie_pmu->pmu = (struct pmu) {
> +		.module		= THIS_MODULE,
> +		.task_ctx_nr	= perf_invalid_context,
> +		.pmu_enable	= NULL,
> +		.pmu_disable	= NULL,
> +		.event_init	= dwc_pcie_pmu_event_init,
> +		.add		= dwc_pcie_pmu_event_add,
> +		.del		= dwc_pcie_pmu_event_del,
> +		.start		= dwc_pcie_pmu_event_start,
> +		.stop		= dwc_pcie_pmu_event_stop,
> +		.read		= dwc_pcie_pmu_event_read,
> +		.attr_groups	= pcie_info->pcie_pmu_attr_groups,
> +		.capabilities	= PERF_PMU_CAP_NO_EXCLUDE,
> +	};
> +
> +	name = devm_kasprintf(priv->dev, GFP_KERNEL, "pcie_bdf_%x",
> +			      pcie_info->bdf);
> +	if (!name)
> +		return -ENOMEM;
> +
> +	/* Pick one CPU to be the preferred one to use */
> +	pcie_pmu->on_cpu = raw_smp_processor_id();
Above there are references to multiple dies.  Maybe at least make sure you
are on a near by die? (I'm guessing at topology!)
> +
> +	ret = perf_pmu_register(&pcie_pmu->pmu, name, -1);
> +	if (ret) {
> +		pci_err(pcie_info->pdev, "Error %d registering PMU @%x\n", ret,
> +				 pcie_info->bdf);
> +		return ret;
> +	}
> +
> +	pcie_info->pmu_is_register = DWC_PCIE_PMU_HAS_REGISTER;

As below. I think you can drop this state info.

> +
> +	return ret;
> +}
> +
> +static int dwc_pcie_pmu_remove(struct platform_device *pdev)
> +{
> +	struct dwc_pcie_pmu_priv *priv = platform_get_drvdata(pdev);
> +	int index;
> +	struct dwc_pcie_pmu *pcie_pmu;
> +
> +	for (index = 0; index < priv->pcie_ctrl_num; index++)
> +		if (priv->pcie_table[index].pmu_is_register) {
> +			pcie_pmu = &priv->pcie_table[index].pcie_pmu;
> +			perf_pmu_unregister(&pcie_pmu->pmu);
> +		}
> +	return 0;
> +}
> +
> +static int dwc_pcie_pmu_probe(struct platform_device *pdev)
> +{
> +	int ret = 0;

Initialized in all paths where it is used. Compiler should be able to tell
that so I doubt you need this to be set to 0 here.

> +	int pcie_index;
> +	struct dwc_pcie_pmu_priv *priv;
> +
> +	priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
> +	if (!priv)
> +		return -ENOMEM;
> +	priv->dev = &pdev->dev;
> +	platform_set_drvdata(pdev, priv);
> +
> +	/* If PMU is not support on current platform, keep slient */
> +	if (dwc_pcie_pmu_discover(priv))
> +		return 0;
> +
> +	for (pcie_index = 0; pcie_index < priv->pcie_ctrl_num; pcie_index++) {
> +		struct pci_dev *rp = priv->pcie_table[pcie_index].pdev;
> +
> +		ret = __dwc_pcie_pmu_probe(priv, &priv->pcie_table[pcie_index]);
> +		if (ret) {
> +			dev_err(&rp->dev, "PCIe PMU probe fail\n");
> +			goto pmu_unregister;
> +		}
> +	}
> +	dev_info(&pdev->dev, "PCIe PMUs registered\n");

Noise in the logs.  There are lots of ways to know if we reached this point
so this adds no value.

> +
> +	return 0;
> +
> +pmu_unregister:
> +	dwc_pcie_pmu_remove(pdev);

I'd much rather see the unwind here directly so we can clearly see that it undoes
the result of errors in this function.  That removes the need to use the
is_registered flag in the remove() function simplifying that flow as well.


> +
> +	return ret;
> +}
> +
> +static struct platform_driver dwc_pcie_pmu_driver = {
> +	.probe = dwc_pcie_pmu_probe,
> +	.remove = dwc_pcie_pmu_remove,
> +	.driver = {.name = DRV_NAME,},
More common to format as
	.driver = {
		.name = "dwc_pcie_pmu",
	},
};
Note use of string here.  Using a define just forces people to
look for this in the wrong place.

> +};
> +
> +static int __init dwc_pcie_pmu_init(void)
> +{
> +	int ret;
> +
> +	ret = platform_driver_register(&dwc_pcie_pmu_driver);
> +
> +	if (ret)
> +		return ret;
> +
> +	dwc_pcie_pmu_dev =
> +	    platform_device_register_simple(DEV_NAME, -1, NULL, 0);

I'd normally expect to see the device created as a result of firmware
description (ACPI DSDT / or Device tree)
It is unusual to create a 'real' device directly in the driver
init - that's normally reserved for various fake / software devices.


> +	if (IS_ERR(dwc_pcie_pmu_dev)) {
> +		platform_driver_unregister(&dwc_pcie_pmu_driver);
> +		return PTR_ERR(dwc_pcie_pmu_dev);
> +	}
> +
> +	return 0;
> +}
> +
> +static void __exit dwc_pcie_pmu_exit(void)
> +{
> +	platform_device_unregister(dwc_pcie_pmu_dev);
> +	platform_driver_unregister(&dwc_pcie_pmu_driver);
> +}
> +
> +module_init(dwc_pcie_pmu_init);
> +module_exit(dwc_pcie_pmu_exit);
> +
> +MODULE_DESCRIPTION("PMU driver for DesignWare Cores PCI Express Controller");
> +MODULE_AUTHOR("xueshuai@...ux.alibaba.com");
> +MODULE_AUTHOR("yinxuan_cw@...ux.alibaba.com");
> +MODULE_LICENSE("GPL v2");

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ