[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <89efd20f-65f2-c082-1eb4-4e308957ff59@linux.alibaba.com>
Date: Mon, 26 Sep 2022 21:31:34 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: Jonathan Cameron <Jonathan.Cameron@...wei.com>,
Bjorn Helgaas <helgaas@...nel.org>
Cc: will@...nel.org, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org, rdunlap@...radead.org,
robin.murphy@....com, mark.rutland@....com,
baolin.wang@...ux.alibaba.com, zhuo.song@...ux.alibaba.com,
linux-pci@...r.kernel.org
Subject: Re: [PATCH v1 2/3] drivers/perf: add DesignWare PCIe PMU driver
+ Bjorn Helgaas
在 2022/9/23 PM11:54, Jonathan Cameron 写道:
>
>>
>>>
>>>> +#define RP_NUM_MAX 32 /* 2die * 4RC * 4Ctrol */
>>>
>>> This driver is 'almost' generic. So if you an avoid defines based on a particular
>>> platform that's definitely good!
>>
>> Good idea. How about defining RP_NUM_MAX as 64? As fars as I know,
>> some platfrom use 2 sockets, 2 die per socket.
>> Then 2 sockets * 2 dies * 4 Root Complex * 4 root port.
>
> Setting a reasonable maximum is fine - but make sure the code then fails with
> a suitable error message if there are more!
OK, I will add a discovery logic here and count PMU number at runtime.
>
>
>>>> +#define DWC_PCIE_LANE_SHIFT 4
>>>> +#define DWC_PCIE_LANE_MASK GENMASK(9, 4)
>>>> +
>>>> +#define DWC_PCIE_EVENT_CNT_CTRL 0x8
>>>> +#define DWC_PCIE__CNT_EVENT_SELECT_SHIFT 16
>>>
>>> Why double __? If point is , then
>>> naming works better
>>> DWC_PCIE_EVENT_CNT_CTRL_REG
>>> DWC_PCIE_EVENT_CNT_CTRL_EV_SELECT_MSK etc
>>
>> Yes, I point to use double `__` to indicate it is a field of register,
>> as CMN and CCN drivers do. I also considered naming with REG explicitly,
>> but the macro is so long that I often have to wrap code into multilines.
>> Any way, it's fine to rename if you still suggest to do so.
>
> I don't particularly mind. This convention was new to me.
Haha, then I will leave the double `__` as CMN and CCN drivers do.
>>>> +struct dwc_pcie_pmu_priv {
>>>> + struct device *dev;
>>>> + u32 pcie_ctrl_num;
>>>> + struct dwc_pcie_info_table *pcie_table;
>>>> +};
>>>> +
>>>> +#define DWC_PCIE_CREATE_BDF(seg, bus, dev, func) \
>>>> + (((seg) << 24) | (((bus) & 0xFF) << 16) | (((dev) & 0xFF) << 8) | (func))
>>>
>>> Superficially this looks pretty standard. Why is is DWC specific?
>>
>> You are right, it is not DWC specific.
>>
>> I found a similar definition in arch/ia64/pci/pci.c .
>>
>> #define PCI_SAL_ADDRESS(seg, bus, devfn, reg) \
>> (((u64) seg << 24) | (bus << 16) | (devfn << 8) | (reg))
>>
>> Should we move it into a common header first?
>
> Maybe. The bus, devfn, reg part is standard bdf, but I don't think
> the PCI 6.0 spec defined a version with the seg in the upper bits.
> I'm not sure if we want to adopt that in LInux.
I found lots of code use seg,bus,devfn,reg with format "%04x:%02x:%02x.%x",
I am not quite familiar with PCIe spec. What do you think about it, Bjorn?
>
>>>
>>>> + pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &header);
>>>> + /* Is the device part of a DesignWare Cores PCIe Controller ? */
>>>
>>> Good question... This code doesn't check that. VSEC ID is matched only with
>>> the Vendor ID of the devices - unlike DVSEC where this would all be nice
>>> and local.
>>
>> I think a similar fashion is
>>
>> u16 pci_find_vsec_capability(struct pci_dev *dev, u16 vendor, int cap)
>>
>> As you see, I don't want to limit this driver to a specific vendor, like
>> Alibaba (0x1ded), because this driver is generic to all DesignWare Cores PCIe
>> Controller. Therefore, dwc_pcie_find_ras_des_cap_position does not check vendor
>> like pci_find_vsec_capability.
>
> You can't do that because another vendor could use the same VSEC ID for
> an entirely different purpose. They are only valid in combination with the device VID.
It make sense to me.
>
> The only way this can work is with a list of specific vendor ID / VSEC pairs for
> known devices.
>
>>
>> Do you mean to use DVSEC instead? I try to read out DVSEC with lspci:
>>
>> # lspci -vvv
>> b0:00.0 PCI bridge: Alibaba (China) Co., Ltd. M1 Root Port (rev 01) (prog-if 00 [Normal decode])
>> [...snip...]
>> Capabilities: [374 v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
>> Capabilities: [474 v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
>> Capabilities: [4ac v1] Data Link Feature <?>
>> Capabilities: [4b8 v1] Designated Vendor-Specific: Vendor=0001 ID=0000 Rev=1 Len=64 <?>
>> Capabilities: [4fc v1] Vendor Specific Information: ID=0005 Rev=1 Len=018 <?>
>>
>> How can we tell it's a DesignWare Cores PCIe Controller?
>
> Gah. This is what DVSEC was defined to solve. It lets you have a common
> vendor defined extended capability defined by a vendor, independent of the
> VID of a given device. With a VSEC you can't write generic code.
>
Got it. But I don't see any description about RAS_DES_CAP register relate to DVSEC
in PCIe Controller TRM. I will check this later.
>>
>>>> + if (PCI_VNDR_HEADER_ID(header) == DWC_PCIE_VSEC_ID &&
>>>> + PCI_VNDR_HEADER_REV(header) == DWC_PCIE_VSEC_REV) {
>>>> + *pos = vsec;
>>>> + return 0;
>>>> + }
>>>> + }
>>>> +
>>>> + return -ENODEV;
>>>> +}
>>>> +
>>>> +static int dwc_pcie_pmu_discover(struct dwc_pcie_pmu_priv *priv)
>>>> +{
>>>> + int val, where, index = 0;
>>>> + struct pci_dev *pdev = NULL;
>>>> + struct dwc_pcie_info_table *pcie_info;
>>>> +
>>>> + priv->pcie_table =
>>>> + devm_kcalloc(priv->dev, RP_NUM_MAX, sizeof(*pcie_info), GFP_KERNEL);
>>>> + if (!priv->pcie_table)
>>>> + return -EINVAL;
>>>> +
>>>> + pcie_info = priv->pcie_table;
>>>> + while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev)) != NULL &&
>>>> + index < RP_NUM_MAX) {
>>>
>>> This having a driver than then walks the pci topology to find root ports and add
>>> extra stuff to them is not a clean solution.
>>>
>>> The probing should be driven from the existing PCI driver topology.
>>> There are a bunch of new features we need to add to ports in the near future
>>> anyway - this would just be another one.
>>> Same problem exists for CXL CPMU perf devices - so far we only support those
>>> on end points, partly because we need a clean way to probe them on pci ports.
>>>
>>> Whatever we come up with there will apply here as well.
>>
>> I see your point. Any link to reference?
>
> No, though hopefully we'll get to some sort of plan in the branch of this thread
> that Bjorn comment in.
>
OK.
>>
>>>
>>>> + if (!pci_dev_is_rootport(pdev))
>>>> + continue;
>>>> +
>>>> + pcie_info[index].bdf = dwc_pcie_get_bdf(pdev);
>>>> + pcie_info[index].pdev = pdev;
>>> Probably want a sanity check this has a vendor ID appropriate the VSEC you are about
>>> to look for.
>>
>> If I check the vendor ID here or in dwc_pcie_find_ras_des_cap_position, this driver
>> will only work for Alibaba as I mentioned before.
>
> Agreed. Unfortunately that's all you can do safely as VSEC IDs are not a global
> namespace.
Should we add a sanity check with a vendor list in dwc_pcie_find_ras_des_cap_position?
>>
>>>> +
>>>> + ret = dwc_pcie_pmu_write_dword(pcie_info, DWC_PCIE_EVENT_CNT_CTRL, val);
>>>> + if (ret)
>>>> + pci_err(pcie_info->pdev, "PCIe write fail\n");
>>>> +
>>>> + return ret;
>>>> +}
>>>
>>> ...
>>>
>>>> +
>>>> +static int dwc_pcie_pmu_read_base_time_counter(struct dwc_pcie_info_table
>>>> + *pcie_info, u64 *counter)
>>>> +{
>>>> + u32 ret, val;
>>>> +
>>>> + ret = dwc_pcie_pmu_read_dword(pcie_info,
>>>> + DWC_PCIE_TIME_BASED_ANALYSIS_DATA_REG_HIGH,
>>>> + &val);
>>>> + if (ret) {
>>>> + pci_err(pcie_info->pdev, "PCIe read fail\n");
>>>> + return ret;
>>>> + }
>>>> +
>>>> + *counter = val;
>>>> + *counter <<= 32;
>>>
>>> This looks like you could get ripping between the upper and lower dwords.
>>> What prevents that? Perhaps a comment to say why that's not a problem?
>>
>> The Time-based Analysis Data which contains the measurement results of
>> RX/TX data throughput and time spent in each low-power LTSSM state is 64 bit.
>> The data is provided by two 32 bit registers so I rip them together. I will
>> add a comment here in next verison.
>
> If I understand correctly the only safe way to read this is in a try / retry loop.
> Read the upper part, then the lower part, then reread the upper part.
> If the upper part is unchanged you did not get ripping across the two registers.
> If it changes, try again.
It make sence to me, I will fix it in next version.
>
>>
>>>
>>>> +
>>>> + ret = dwc_pcie_pmu_read_dword(pcie_info,
>>>> + DWC_PCIE_TIME_BASED_ANALYSIS_DATA_REG_LOW,
>>>> + &val);
>>>> + if (ret) {
>>>> + pci_err(pcie_info->pdev, "PCIe read fail\n");
>>>> + return ret;
>>>> + }
>>>> +
>>>> + *counter += val;
>>>> +
>>>> + return ret;
>>>> +}
>>> ...
>>>
>>>> +
>>>> + ret = perf_pmu_register(&pcie_pmu->pmu, name, -1);
>>>> + if (ret) {
>>>> + pci_err(pcie_info->pdev, "Error %d registering PMU @%x\n", ret,
>>>> + pcie_info->bdf);
>>>> + return ret;
>>>> + }
>>>> +
>>>> + pcie_info->pmu_is_register = DWC_PCIE_PMU_HAS_REGISTER;
>>>
>>> As below. I think you can drop this state info.
>>
>> Please see my confusion bellow.
>>
>>>
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> +static int dwc_pcie_pmu_remove(struct platform_device *pdev)
>>>> +{
>>>> + struct dwc_pcie_pmu_priv *priv = platform_get_drvdata(pdev);
>>>> + int index;
>>>> + struct dwc_pcie_pmu *pcie_pmu;
>>>> +
>>>> + for (index = 0; index < priv->pcie_ctrl_num; index++)
>>>> + if (priv->pcie_table[index].pmu_is_register) {
>>>> + pcie_pmu = &priv->pcie_table[index].pcie_pmu;
>>>> + perf_pmu_unregister(&pcie_pmu->pmu);
>>>> + }
>>>> + return 0;
>>>> +}
>>>> +
>>>> +static int dwc_pcie_pmu_probe(struct platform_device *pdev)
>>>> +{
>>>> + int ret = 0;
>>>
>>> Initialized in all paths where it is used. Compiler should be able to tell
>>> that so I doubt you need this to be set to 0 here.
>>
>> Agree, will leave it as uninitialized.
>>
>>>
>>>> + int pcie_index;
>>>> + struct dwc_pcie_pmu_priv *priv;
>>>> +
>>>> + priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
>>>> + if (!priv)
>>>> + return -ENOMEM;
>>>> + priv->dev = &pdev->dev;
>>>> + platform_set_drvdata(pdev, priv);
>>>> +
>>>> + /* If PMU is not support on current platform, keep slient */
>>>> + if (dwc_pcie_pmu_discover(priv))
>>>> + return 0;
>>>> +
>>>> + for (pcie_index = 0; pcie_index < priv->pcie_ctrl_num; pcie_index++) {
>>>> + struct pci_dev *rp = priv->pcie_table[pcie_index].pdev;
>>>> +
>>>> + ret = __dwc_pcie_pmu_probe(priv, &priv->pcie_table[pcie_index]);
>>>> + if (ret) {
>>>> + dev_err(&rp->dev, "PCIe PMU probe fail\n");
>>>> + goto pmu_unregister;
>>>> + }
>>>> + }
>>>> + dev_info(&pdev->dev, "PCIe PMUs registered\n");
>>>
>>> Noise in the logs. There are lots of ways to know if we reached this point
>>> so this adds no value.
>>
>> Got it, will drop this out in next version.
>>
>>>
>>>> +
>>>> + return 0;
>>>> +
>>>> +pmu_unregister:
>>>> + dwc_pcie_pmu_remove(pdev);
>>>
>>> I'd much rather see the unwind here directly so we can clearly see that it undoes
>>> the result of errors in this function. That removes the need to use the
>>> is_registered flag in the remove() function simplifying that flow as well.
>>
>> Do you mean that if perf_pmu_register fails, then jump to pmu_unregister lable directly?
>> How can we tell which PMU diveice fails to reigister?
>
> pcie_index will be set to the index of the PMU device that failed - so loops backwards
> from that removing them.
Good idea. I will fix it in next version.
>>
> .
>>
>>>
>>>> +};
>>>> +
>>>> +static int __init dwc_pcie_pmu_init(void)
>>>> +{
>>>> + int ret;
>>>> +
>>>> + ret = platform_driver_register(&dwc_pcie_pmu_driver);
>>>> +
>>>> + if (ret)
>>>> + return ret;
>>>> +
>>>> + dwc_pcie_pmu_dev =
>>>> + platform_device_register_simple(DEV_NAME, -1, NULL, 0);
>>>
>>> I'd normally expect to see the device created as a result of firmware
>>> description (ACPI DSDT / or Device tree)
>>> It is unusual to create a 'real' device directly in the driver
>>> init - that's normally reserved for various fake / software devices.
>>
>> I see your concerns. You mentioned that
>>
>> > The probing should be driven from the existing PCI driver topology.
>>
>> Should we add a fake device in firmware or drive from PCI driver topology?
>
> Ah. I was reviewing backwards so when I wrote this hadn't realized you walk
> the PCI topology. PCI driver topology is the right solution here.
I see, I will use PCI driver topology instead.
>
>>
>> Thank you.
>>
>> Best Regards,
>> Shuai
>>
Powered by blists - more mailing lists