lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 7 Jan 2021 22:54:38 -0500
From:   Don Dutile <ddutile@...hat.com>
To:     Bjorn Helgaas <helgaas@...nel.org>,
        Leon Romanovsky <leon@...nel.org>
Cc:     Bjorn Helgaas <bhelgaas@...gle.com>,
        Saeed Mahameed <saeedm@...dia.com>,
        Leon Romanovsky <leonro@...dia.com>,
        Jason Gunthorpe <jgg@...dia.com>,
        Jakub Kicinski <kuba@...nel.org>, linux-pci@...r.kernel.org,
        linux-rdma@...r.kernel.org, netdev@...r.kernel.org,
        Alex Williamson <alex.williamson@...hat.com>
Subject: Re: [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for
 SR-IOV VFs

On 1/7/21 7:57 PM, Bjorn Helgaas wrote:
> [+cc Alex, Don]
>
> This patch does not actually *configure* the number of vectors, so the
> subject is not quite accurate.  IIUC, this patch adds a sysfs file
> that can be used to configure the number of vectors.  The subject
> should mention the sysfs connection.
>
> On Sun, Jan 03, 2021 at 10:24:37AM +0200, Leon Romanovsky wrote:
>> From: Leon Romanovsky <leonro@...dia.com>
>>
>> This function is applicable for SR-IOV VFs because such devices allocate
>> their MSI-X table before they will run on the targeted hardware and they
>> can't guess the right amount of vectors.
> This sentence doesn't quite have enough context to make sense to me.
> Per PCIe r5.0, sec 9.5.1.2, I think PFs and VFs have independent MSI-X
> Capabilities.  What is the connection between the PF MSI-X and the VF
> MSI-X?
+1... strip this commit log section and write it with correct, technical content.
PFs & VF's have indep MSIX caps.

Q: is this an issue where (some) mlx5's have a large msi-x capability (per VF) that may overwhelm a system's, (pci-(sub)-tree) MSI / intr capability,
and this is a sysfs-based tuning knob to reduce the max number on such 'challenged' systems?
-- ah; reading further below, it's based on some information gleemed from the VM's capability for intr. support.
     -- or maybe IOMMU (intr) support on the host system, and the VF can't exceed it or config failure in VM... whatever... its some VM cap that's being accomodated.
> The MSI-X table sizes should be determined by the Table Size in the
> Message Control register.  Apparently we write a VF's Table Size
> before a driver is bound to the VF?  Where does that happen?
>
> "Before they run on the targeted hardware" -- do you mean before the
> VF is passed through to a guest virtual machine?  You mention "target
> VM" below, which makes more sense to me.  VFs don't "run"; they're not
> software.  I apologize for not being an expert in the use of VFs.
>
> Please mention the sysfs path in the commit log.
>
>> Signed-off-by: Leon Romanovsky <leonro@...dia.com>
>> ---
>>   Documentation/ABI/testing/sysfs-bus-pci | 16 +++++++
>>   drivers/pci/iov.c                       | 57 +++++++++++++++++++++++++
>>   drivers/pci/msi.c                       | 30 +++++++++++++
>>   drivers/pci/pci-sysfs.c                 |  1 +
>>   drivers/pci/pci.h                       |  1 +
>>   include/linux/pci.h                     |  8 ++++
>>   6 files changed, 113 insertions(+)
>>
>> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
>> index 25c9c39770c6..30720a9e1386 100644
>> --- a/Documentation/ABI/testing/sysfs-bus-pci
>> +++ b/Documentation/ABI/testing/sysfs-bus-pci
>> @@ -375,3 +375,19 @@ Description:
>>   		The value comes from the PCI kernel device state and can be one
>>   		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
>>   		The file is read only.
>> +
>> +What:		/sys/bus/pci/devices/.../vf_msix_vec
>> +Date:		December 2020
>> +Contact:	Leon Romanovsky <leonro@...dia.com>
>> +Description:
>> +		This file is associated with the SR-IOV VFs. It allows overwrite
>> +		the amount of MSI-X vectors for that VF. This is needed to optimize
>> +		performance of newly bounded devices by allocating the number of
>> +		vectors based on the internal knowledge of targeted VM.
> s/allows overwrite/allows configuration of/
> s/for that/for the/
> s/amount of/number of/
> s/bounded/bound/
>
> What "internal knowledge" is this?  AFAICT this would have to be some
> user-space administration knowledge, not anything internal to the
> kernel.
Correct; likely a libvirt VM (section of its) description;

>
>> +		The values accepted are:
>> +		 * > 0 - this will be number reported by the PCI VF's PCIe MSI-X capability.
> s/PCI// (it's obvious we're talking about PCI here)
> s/PCIe// (MSI-X is not PCIe-specific, and there's no need to mention
> it at all)
>
>> +		 * < 0 - not valid
>> +		 * = 0 - will reset to the device default value
>> +
>> +		The file is writable if no driver is bounded.
>  From the code, it looks more like this:
>
>    The file is writable if the PF is bound to a driver that supports
>    the ->sriov_set_msix_vec_count() callback and there is no driver
>    bound to the VF.
>
> Please wrap all of this to fit in 80 columns like the rest of the file.
>
>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> index 4afd4ee4f7f0..0f8c570361fc 100644
>> --- a/drivers/pci/iov.c
>> +++ b/drivers/pci/iov.c
>> @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
>>   	return (dev->devfn + dev->sriov->offset +
>>   		dev->sriov->stride * vf_id) & 0xff;
>>   }
>> +EXPORT_SYMBOL(pci_iov_virtfn_devfn);
>>
>>   /*
>>    * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
>> @@ -426,6 +427,62 @@ const struct attribute_group sriov_dev_attr_group = {
>>   	.is_visible = sriov_attrs_are_visible,
>>   };
>>
>> +#ifdef CONFIG_PCI_MSI
>> +static ssize_t vf_msix_vec_show(struct device *dev,
>> +				struct device_attribute *attr, char *buf)
>> +{
>> +	struct pci_dev *pdev = to_pci_dev(dev);
>> +	int numb = pci_msix_vec_count(pdev);
>> +
>> +	if (numb < 0)
>> +		return numb;
>> +
>> +	return sprintf(buf, "%d\n", numb);
>> +}
>> +
>> +static ssize_t vf_msix_vec_store(struct device *dev,
>> +				 struct device_attribute *attr, const char *buf,
>> +				 size_t count)
>> +{
>> +	struct pci_dev *vf_dev = to_pci_dev(dev);
>> +	int val, ret;
>> +
>> +	ret = kstrtoint(buf, 0, &val);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = pci_set_msix_vec_count(vf_dev, val);
>> +	if (ret)
>> +		return ret;
>> +
>> +	return count;
>> +}
>> +static DEVICE_ATTR_RW(vf_msix_vec);
>> +#endif
>> +
>> +static struct attribute *sriov_vf_dev_attrs[] = {
>> +#ifdef CONFIG_PCI_MSI
>> +	&dev_attr_vf_msix_vec.attr,
>> +#endif
>> +	NULL,
>> +};
>> +
>> +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
>> +					  struct attribute *a, int n)
>> +{
>> +	struct device *dev = kobj_to_dev(kobj);
>> +
>> +	if (dev_is_pf(dev))
>> +		return 0;
>> +
>> +	return a->mode;
>> +}
>> +
>> +const struct attribute_group sriov_vf_dev_attr_group = {
>> +	.attrs = sriov_vf_dev_attrs,
>> +	.is_visible = sriov_vf_attrs_are_visible,
>> +};
>> +
>>   int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
>>   {
>>   	return 0;
>> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
>> index 3162f88fe940..0bcd705487d9 100644
>> --- a/drivers/pci/msi.c
>> +++ b/drivers/pci/msi.c
>> @@ -991,6 +991,36 @@ int pci_msix_vec_count(struct pci_dev *dev)
>>   }
>>   EXPORT_SYMBOL(pci_msix_vec_count);
>>
>> +/**
>> + * pci_set_msix_vec_count - change the reported number of MSI-X vectors.
> Drop period at end, as other kernel doc in this file does.
>
>> + * This function is applicable for SR-IOV VFs because such devices allocate
>> + * their MSI-X table before they will run on the targeted hardware and they
>> + * can't guess the right amount of vectors.
>> + * @dev: VF device that is going to be changed.
>> + * @numb: amount of MSI-X vectors.
> Rewrite the "such devices allocate..." part based on the questions in
> the commit log.  Same with "targeted hardware."
>
> s/amount of/number of/
> Drop periods at end of parameter descriptions.
>
>> + **/
>> +int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
>> +{
>> +	struct pci_dev *pdev = pci_physfn(dev);
>> +
>> +	if (!dev->msix_cap || !pdev->msix_cap)
>> +		return -EINVAL;
>> +
>> +	if (dev->driver || !pdev->driver ||
>> +	    !pdev->driver->sriov_set_msix_vec_count)
>> +		return -EOPNOTSUPP;
>> +
>> +	if (numb < 0)
>> +		/*
>> +		 * We don't support negative numbers for now,
>> +		 * but maybe in the future it will make sense.
>> +		 */
>> +		return -EINVAL;
>> +
>> +	return pdev->driver->sriov_set_msix_vec_count(dev, numb);
> So we write to a VF sysfs file, get here and look up the PF, call a PF
> driver callback with the VF as an argument, the callback (at least for
> mlx5) looks up the PF from the VF, then does some mlx5-specific magic
> to the PF that influences the VF somehow?
There's no PF lookup above.... it's just checking if a pdev has a driver with the desired msix-cap setting(reduction) feature.

> Help me connect the dots here.  Is this required because of something
> peculiar to mlx5, or is something like this required for all SR-IOV
> devices because of the way the PCIe spec is written?
So, overall, I'm guessing the mlx5 device can have 1000's of MSIX -- say, one per send/receive/completion queue.
This device capability may exceed the max number MSIX a VM can have/support (depending on guestos).
So, a sysfs tunable is used to set the max MSIX available, and thus, the device puts >1 send/rcv/completion queue intr on a given MSIX.

ok, time for Leon to better state what this patch does,
and why it's needed on mlx5 (and may be applicable to other/future high-MSIX devices assigned to VMs (NVME?)).
Hmmm, now that I said it, why is it SRIOV-centric and not pci-device centric (can pass a PF w/high number of MSIX to a VM).

-Don
>> +}
>> +EXPORT_SYMBOL(pci_set_msix_vec_count);
>> +
>>   static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
>>   			     int nvec, struct irq_affinity *affd, int flags)
>>   {
>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>> index fb072f4b3176..0af2222643c2 100644
>> --- a/drivers/pci/pci-sysfs.c
>> +++ b/drivers/pci/pci-sysfs.c
>> @@ -1557,6 +1557,7 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
>>   	&pci_dev_hp_attr_group,
>>   #ifdef CONFIG_PCI_IOV
>>   	&sriov_dev_attr_group,
>> +	&sriov_vf_dev_attr_group,
>>   #endif
>>   	&pci_bridge_attr_group,
>>   	&pcie_dev_attr_group,
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index 5c59365092fa..46396a5da2d9 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -502,6 +502,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno);
>>   void pci_restore_iov_state(struct pci_dev *dev);
>>   int pci_iov_bus_range(struct pci_bus *bus);
>>   extern const struct attribute_group sriov_dev_attr_group;
>> +extern const struct attribute_group sriov_vf_dev_attr_group;
>>   #else
>>   static inline int pci_iov_init(struct pci_dev *dev)
>>   {
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index b32126d26997..1acba40a1b1b 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -856,6 +856,8 @@ struct module;
>>    *		e.g. drivers/net/e100.c.
>>    * @sriov_configure: Optional driver callback to allow configuration of
>>    *		number of VFs to enable via sysfs "sriov_numvfs" file.
>> + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
>> + *              exposed by the sysfs "vf_msix_vec" entry.
>>    * @err_handler: See Documentation/PCI/pci-error-recovery.rst
>>    * @groups:	Sysfs attribute groups.
>>    * @driver:	Driver model structure.
>> @@ -871,6 +873,7 @@ struct pci_driver {
>>   	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
>>   	void (*shutdown)(struct pci_dev *dev);
>>   	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
>> +	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
>>   	const struct pci_error_handlers *err_handler;
>>   	const struct attribute_group **groups;
>>   	struct device_driver	driver;
>> @@ -1464,6 +1467,7 @@ struct msix_entry {
>>   int pci_msi_vec_count(struct pci_dev *dev);
>>   void pci_disable_msi(struct pci_dev *dev);
>>   int pci_msix_vec_count(struct pci_dev *dev);
>> +int pci_set_msix_vec_count(struct pci_dev *dev, int numb);
> This patch adds the pci_set_msix_vec_count() definition in pci/msi.c
> and a call in pci/iov.c.  It doesn't need to be declared in
> include/linux/pci.h or exported.  It can be declared in
> drivers/pci/pci.h.
>
>>   void pci_disable_msix(struct pci_dev *dev);
>>   void pci_restore_msi_state(struct pci_dev *dev);
>>   int pci_msi_enabled(void);
>> @@ -2402,6 +2406,10 @@ static inline bool pci_is_thunderbolt_attached(struct pci_dev *pdev)
>>   void pci_uevent_ers(struct pci_dev *pdev, enum  pci_ers_result err_type);
>>   #endif
>>
>> +#ifdef CONFIG_PCI_IOV
>> +int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id);
>> +#endif
> pci_iov_virtfn_devfn() is already declared in this file.
>
>>   /* Provide the legacy pci_dma_* API */
>>   #include <linux/pci-dma-compat.h>
>>
>> --
>> 2.29.2
>>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ