[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210114172620.GC944463@unreal>
Date: Thu, 14 Jan 2021 19:26:20 +0200
From: Leon Romanovsky <leon@...nel.org>
To: Alexander Duyck <alexander.duyck@...il.com>
Cc: Bjorn Helgaas <bhelgaas@...gle.com>,
Saeed Mahameed <saeedm@...dia.com>,
Jason Gunthorpe <jgg@...dia.com>,
Jakub Kicinski <kuba@...nel.org>,
linux-pci <linux-pci@...r.kernel.org>,
linux-rdma@...r.kernel.org, Netdev <netdev@...r.kernel.org>,
Don Dutile <ddutile@...hat.com>,
Alex Williamson <alex.williamson@...hat.com>
Subject: Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read
number of MSI-X vectors
On Thu, Jan 14, 2021 at 08:40:14AM -0800, Alexander Duyck wrote:
> On Wed, Jan 13, 2021 at 10:50 PM Leon Romanovsky <leon@...nel.org> wrote:
> >
> > On Wed, Jan 13, 2021 at 02:44:45PM -0800, Alexander Duyck wrote:
> > > On Tue, Jan 12, 2021 at 10:19 PM Leon Romanovsky <leon@...nel.org> wrote:
> > > >
> > > > On Tue, Jan 12, 2021 at 01:34:50PM -0800, Alexander Duyck wrote:
> > > > > On Mon, Jan 11, 2021 at 10:56 PM Leon Romanovsky <leon@...nel.org> wrote:
> > > > > >
> > > > > > On Mon, Jan 11, 2021 at 11:30:39AM -0800, Alexander Duyck wrote:
> > > > > > > On Sun, Jan 10, 2021 at 7:10 AM Leon Romanovsky <leon@...nel.org> wrote:
> > > > > > > >
> > > > > > > > From: Leon Romanovsky <leonro@...dia.com>
> > > > > > > >
> > > > > > > > Some SR-IOV capable devices provide an ability to configure specific
> > > > > > > > number of MSI-X vectors on their VF prior driver is probed on that VF.
> > > > > > > >
> > > > > > > > In order to make management easy, provide new read-only sysfs file that
> > > > > > > > returns a total number of possible to configure MSI-X vectors.
> > > > > > > >
> > > > > > > > cat /sys/bus/pci/devices/.../sriov_vf_total_msix
> > > > > > > > = 0 - feature is not supported
> > > > > > > > > 0 - total number of MSI-X vectors to consume by the VFs
> > > > > > > >
> > > > > > > > Signed-off-by: Leon Romanovsky <leonro@...dia.com>
> > > > > > > > ---
> > > > > > > > Documentation/ABI/testing/sysfs-bus-pci | 14 +++++++++++
> > > > > > > > drivers/pci/iov.c | 31 +++++++++++++++++++++++++
> > > > > > > > drivers/pci/pci.h | 3 +++
> > > > > > > > include/linux/pci.h | 2 ++
> > > > > > > > 4 files changed, 50 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > > > > > > > index 05e26e5da54e..64e9b700acc9 100644
> > > > > > > > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > > > > > > > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > > > > > > > @@ -395,3 +395,17 @@ Description:
> > > > > > > > The file is writable if the PF is bound to a driver that
> > > > > > > > supports the ->sriov_set_msix_vec_count() callback and there
> > > > > > > > is no driver bound to the VF.
> > > > > > > > +
> > > > > > > > +What: /sys/bus/pci/devices/.../sriov_vf_total_msix
> > > > > > >
> > > > > > > In this case I would drop the "vf" and just go with sriov_total_msix
> > > > > > > since now you are referring to a global value instead of a per VF
> > > > > > > value.
> > > > > >
> > > > > > This field indicates the amount of MSI-X available for VFs, it doesn't
> > > > > > include PFs. The missing "_vf_" will mislead users who will believe that
> > > > > > it is all MSI-X vectors available for this device. They will need to take
> > > > > > into consideration amount of PF MSI-X in order to calculate the VF distribution.
> > > > > >
> > > > > > So I would leave "_vf_" here.
> > > > >
> > > > > The problem is you aren't indicating how many are available for an
> > > > > individual VF though, you are indicating how many are available for
> > > > > use by SR-IOV to give to the VFs. The fact that you are dealing with a
> > > > > pool makes things confusing in my opinion. For example sriov_vf_device
> > > > > describes the device ID that will be given to each VF.
> > > >
> > > > sriov_vf_device is different and is implemented accordingly to the PCI
> > > > spec, 9.3.3.11 VF Device ID (Offset 1Ah)
> > > > "This field contains the Device ID that should be presented for every VF
> > > > to the SI."
> > > >
> > > > It is one ID for all VFs.
> > >
> > > Yes, but that is what I am getting at. It is also what the device
> > > configuration will be for one VF. So when I read sriov_vf_total_msix
> > > it reads as the total for a single VF, not all of the the VFs. That is
> > > why I think dropping the "vf_" part of the name would make sense, as
> > > what you are describing is the total number of MSI-X vectors for use
> > > by SR-IOV VFs.
> >
> > I can change to anything as long as new name will give clear indication
> > that this total number is for VFs and doesn't include SR-IOV PF MSI-X.
>
> It is interesting that you make that distinction.
>
> So in the case of the Intel hardware we had one pool of MSI-X
> interrupts that was available for the entire port, both PF and VF.
> When we enabled SR-IOV we had to repartition that pool in order to
> assign interrupts to devices. So it sounds like in your case you don't
> do that and instead the PF is static and the VFs are the only piece
> that is flexible. Do I have that correct?
It is partially correct. The mlx5 devices have ability to change MSI-X
vectors of PF too, but to do so, you will need driver reload and much
more complex user interface. So we (SW) leave it as static.
>
> > >
> > > > >
> > > > > > >
> > > > > > > > +Date: January 2021
> > > > > > > > +Contact: Leon Romanovsky <leonro@...dia.com>
> > > > > > > > +Description:
> > > > > > > > + This file is associated with the SR-IOV PFs.
> > > > > > > > + It returns a total number of possible to configure MSI-X
> > > > > > > > + vectors on the enabled VFs.
> > > > > > > > +
> > > > > > > > + The values returned are:
> > > > > > > > + * > 0 - this will be total number possible to consume by VFs,
> > > > > > > > + * = 0 - feature is not supported
> > > > > > > > +
> > > > > > > > + If no SR-IOV VFs are enabled, this value will return 0.
> > > > > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > > > > > > index 42c0df4158d1..0a6ddf3230fd 100644
> > > > > > > > --- a/drivers/pci/iov.c
> > > > > > > > +++ b/drivers/pci/iov.c
> > > > > > > > @@ -394,12 +394,22 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev,
> > > > > > > > return count;
> > > > > > > > }
> > > > > > > >
> > > > > > > > +static ssize_t sriov_vf_total_msix_show(struct device *dev,
> > > > > > > > + struct device_attribute *attr,
> > > > > > > > + char *buf)
> > > > > > > > +{
> > > > > > > > + struct pci_dev *pdev = to_pci_dev(dev);
> > > > > > > > +
> > > > > > > > + return sprintf(buf, "%d\n", pdev->sriov->vf_total_msix);
> > > > > > > > +}
> > > > > > > > +
> > > > > > >
> > > > > > > You display it as a signed value, but unsigned values are not
> > > > > > > supported, correct?
> > > > > >
> > > > > > Right, I made it similar to the vf_msix_set. I can change.
> > > > > >
> > > > > > >
> > > > > > > > static DEVICE_ATTR_RO(sriov_totalvfs);
> > > > > > > > static DEVICE_ATTR_RW(sriov_numvfs);
> > > > > > > > static DEVICE_ATTR_RO(sriov_offset);
> > > > > > > > static DEVICE_ATTR_RO(sriov_stride);
> > > > > > > > static DEVICE_ATTR_RO(sriov_vf_device);
> > > > > > > > static DEVICE_ATTR_RW(sriov_drivers_autoprobe);
> > > > > > > > +static DEVICE_ATTR_RO(sriov_vf_total_msix);
> > > > > > > >
> > > > > > > > static struct attribute *sriov_dev_attrs[] = {
> > > > > > > > &dev_attr_sriov_totalvfs.attr,
> > > > > > > > @@ -408,6 +418,7 @@ static struct attribute *sriov_dev_attrs[] = {
> > > > > > > > &dev_attr_sriov_stride.attr,
> > > > > > > > &dev_attr_sriov_vf_device.attr,
> > > > > > > > &dev_attr_sriov_drivers_autoprobe.attr,
> > > > > > > > + &dev_attr_sriov_vf_total_msix.attr,
> > > > > > > > NULL,
> > > > > > > > };
> > > > > > > >
> > > > > > > > @@ -658,6 +669,7 @@ static void sriov_disable(struct pci_dev *dev)
> > > > > > > > sysfs_remove_link(&dev->dev.kobj, "dep_link");
> > > > > > > >
> > > > > > > > iov->num_VFs = 0;
> > > > > > > > + iov->vf_total_msix = 0;
> > > > > > > > pci_iov_set_numvfs(dev, 0);
> > > > > > > > }
> > > > > > > >
> > > > > > > > @@ -1116,6 +1128,25 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
> > > > > > > > }
> > > > > > > > EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
> > > > > > > >
> > > > > > > > +/**
> > > > > > > > + * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs
> > > > > > > > + * @dev: the PCI PF device
> > > > > > > > + * @numb: the total number of MSI-X vector to consume by the VFs
> > > > > > > > + *
> > > > > > > > + * Sets the number of MSI-X vectors that is possible to consume by the VFs.
> > > > > > > > + * This interface is complimentary part of the pci_set_msix_vec_count()
> > > > > > > > + * that will be used to configure the required number on the VF.
> > > > > > > > + */
> > > > > > > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb)
> > > > > > > > +{
> > > > > > > > + if (!dev->is_physfn || !dev->driver ||
> > > > > > > > + !dev->driver->sriov_set_msix_vec_count)
> > > > > > > > + return;
> > > > > > > > +
> > > > > > > > + dev->sriov->vf_total_msix = numb;
> > > > > > > > +}
> > > > > > > > +EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix);
> > > > > > > > +
> > > > > > >
> > > > > > > This seems broken. What validation is being done on the numb value?
> > > > > > > You pass it as int, and your documentation all refers to tests for >=
> > > > > > > 0, but isn't a signed input a possibility as well? Also "numb" doesn't
> > > > > > > make for a good abbreviation as it is already a word of its own. It
> > > > > > > might make more sense to use count or something like that rather than
> > > > > > > trying to abbreviate number.
> > > > > >
> > > > > > "Broken" is a nice word to describe misunderstanding.
> > > > >
> > > > > Would you prefer "lacking input validation".
> > > > >
> > > > > I see all this code in there checking for is_physfn and driver and
> > > > > sriov_set_msix_vec_count before allowing the setting of vf_total_msix.
> > > > > It just seems like a lot of validation is taking place on the wrong
> > > > > things if you are just going to be setting a value reporting the total
> > > > > number of MSI-X vectors in use for SR-IOV.
> > > >
> > > > All those checks are in place to ensure that we are not overwriting the
> > > > default value, which is 0.
> > >
> > > Okay, so what you really have is surplus interrupts that you are
> > > wanting to give out to VF devices. So when we indicate 0 here as the
> > > default it really means we have no additional interrupts to give out.
> > > Am I understanding that correctly?
> >
> > The vf_total_msix is static value and shouldn't be recalculated after
> > every MSI-X vector number change. So 0 means that driver doesn't support
> > at all this feature. The operator is responsible to make proper assignment
> > calculations, because he is already doing it for the CPUs and netdev queues.
>
> Honestly that makes things even uglier. So basically this is a feature
> where if it isn't supported it will make it look like the SR-IOV
> device doesn't support MSI-X. I realize it is just the way it is
> worded but it isn't very pretty.
I'm not native English speaker, we can work together later on the
Documentation to make it more clear.
>
> > >
> > > The problem is this is very vendor specific so I am doing my best to
> > > understand it as it is different then the other NICs I have worked
> > > with.
> >
> > There is nothing vendor specific here. There are two types of devices:
> > 1. Support this feature. - The vf_total_msix will be greater than 0 for them
> > and their FW will do sanity checks when user overwrites their default number
> > that they sat in the VF creation stage.
> > 2. Doesn't support this feature - The vf_total_msix will be 0.
> >
> > It is PCI spec, so those "other NICs" that didn't implement the PCI spec
> > will stay with option #2. It is not different from current situation.
>
> Where in the spec is this?
>
> I know in the PCI spec it says that the MSI-X table size is read-only
> and is not supposed to be written by system software. That is what is
> being overwritten right now by your patches that has me concerned.
My patches are not over-writting anything, they are asking from FW to
set more appropriate value. The field stays read-only.
<...>
> >
> > I remind you that this feature is applicable to all SR-IOV devices, we have
> > RDMA, NVMe, crypto, FPGA and netdev VFs. The devlink is not supported
> > outside of netdev world and implementation there will make this feature
> > is far from being usable.
>
> To quote the documentation:
> "devlink is an API to expose device information and resources not directly
> related to any device class, such as chip-wide/switch-ASIC-wide configuration."
There is a great world outside of netdev and it doesn't include devlink. :)
>
> > Right now, the configuration of PCI/core is done through sysfs, so let's
> > review this feature from PCI/core perspective and not from netdev where
> > sysfs is not widely used.
>
> The problem is what you are configuring is not necessarily PCI device
> specific. You are configuring the firmware which operates at a level
> above the PCI device. You just have it manifesting as a PCI behavior
> as you are editing a read-only configuration space field.
In our devices, PCI config space is managed by FW and it represents HW.
>
> Also as I mentioned a few times now the approach you are taking
> violates the PCIe spec by essentially providing a means of making a
> read-only field writable.
AS you said, we see the same picture differently.
Thansk
Powered by blists - more mailing lists