[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210401012340.GA1423690@bjorn-Precision-5520>
Date: Wed, 31 Mar 2021 20:23:40 -0500
From: Bjorn Helgaas <helgaas@...nel.org>
To: Leon Romanovsky <leon@...nel.org>
Cc: Jason Gunthorpe <jgg@...pe.ca>,
Alexander Duyck <alexander.duyck@...il.com>,
Keith Busch <kbusch@...nel.org>,
Bjorn Helgaas <bhelgaas@...gle.com>,
Saeed Mahameed <saeedm@...dia.com>,
Jakub Kicinski <kuba@...nel.org>,
linux-pci <linux-pci@...r.kernel.org>,
linux-rdma@...r.kernel.org, Netdev <netdev@...r.kernel.org>,
Don Dutile <ddutile@...hat.com>,
Alex Williamson <alex.williamson@...hat.com>,
"David S . Miller" <davem@...emloft.net>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
"Rafael J. Wysocki" <rafael@...nel.org>
Subject: Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
[+cc Rafael, in case you're interested in the driver core issue here]
On Wed, Mar 31, 2021 at 07:08:07AM +0300, Leon Romanovsky wrote:
> On Tue, Mar 30, 2021 at 03:41:41PM -0500, Bjorn Helgaas wrote:
> > On Tue, Mar 30, 2021 at 04:47:16PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Mar 30, 2021 at 10:00:19AM -0500, Bjorn Helgaas wrote:
> > > > On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:
> > > > >
> > > > > > I think I misunderstood Greg's subdirectory comment. We already have
> > > > > > directories like this:
> > > > >
> > > > > Yes, IIRC, Greg's remark applies if you have to start creating
> > > > > directories with manual kobjects.
> > > > >
> > > > > > and aspm_ctrl_attr_group (for "link") is nicely done with static
> > > > > > attributes. So I think we could do something like this:
> > > > > >
> > > > > > /sys/bus/pci/devices/0000:01:00.0/ # PF directory
> > > > > > sriov/ # SR-IOV related stuff
> > > > > > vf_total_msix
> > > > > > vf_msix_count_BB:DD.F # includes bus/dev/fn of first VF
> > > > > > ...
> > > > > > vf_msix_count_BB:DD.F # includes bus/dev/fn of last VF
> > > > >
> > > > > It looks a bit odd that it isn't a subdirectory, but this seems
> > > > > reasonable.
> > > >
> > > > Sorry, I missed your point; you'll have to lay it out more explicitly.
> > > > I did intend that "sriov" *is* a subdirectory of the 0000:01:00.0
> > > > directory. The full paths would be:
> > > >
> > > > /sys/bus/pci/devices/0000:01:00.0/sriov/vf_total_msix
> > > > /sys/bus/pci/devices/0000:01:00.0/sriov/vf_msix_count_BB:DD.F
> > > > ...
> > >
> > > Sorry, I was meaning what you first proposed:
> > >
> > > /sys/bus/pci/devices/0000:01:00.0/sriov/BB:DD.F/vf_msix_count
> > >
> > > Which has the extra sub directory to organize the child VFs.
> > >
> > > Keep in mind there is going to be alot of VFs here, > 1k - so this
> > > will be a huge directory.
> >
> > With 0000:01:00.0/sriov/vf_msix_count_BB:DD.F, sriov/ will contain
> > 1 + 1K files ("vf_total_msix" + 1 per VF).
> >
> > With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain
> > 1 file and 1K subdirectories.
>
> This is racy by design, in order to add new file and create BB:DD.F
> directory, the VF will need to do it after or during it's creation.
> During PF creation it is unknown to PF those BB:DD.F values.
>
> The race here is due to the events of PF,VF directory already sent
> but new directory structure is not ready yet.
>
> From code perspective, we will need to add something similar to
> pci_iov_sysfs_link() with the code that you didn't like in previous
> variants (the one that messes with sysfs_create_file API).
>
> It looks not good for large SR-IOV systems with >1K VFs with
> gazillion subdirectories inside PF, while the more natural is to see
> them in VF.
>
> So I'm completely puzzled why you want to do these files on PF and
> not on VF as v0, v7 and v8 proposed.
On both mlx5 and NVMe, the "assign vectors to VF" functionality is
implemented by the PF, so I think it's reasonable to explore the idea
of "associate the vector assignment sysfs file with the PF."
Assume 1K VFs. Either way we have >1K subdirectories of
/sys/devices/pci0000:00/. I think we should avoid an extra
subdirectory level, so I think the choices on the table are:
Associate "vf_msix_count" with the PF:
- /sys/.../<PF>/sriov/vf_total_msix # all on PF
- /sys/.../<PF>/sriov/vf_msix_count_BB:DD.F (1K of these). Greg
says the number of these is not a problem.
- The "vf_total_msix" and "vf_msix_count_*" files are all related
and are grouped together in PF/sriov/.
- The "vf_msix_count_*" files operate directly on the PF. Lock the
PF for serialization, lookup and lock the VF to ensure no VF
driver, call PF driver callback to assign vectors.
- Requires special sysfs code to create/remove "vf_msix_count_*"
files when setting/clearing VF Enable. This code could create
them only when the PF driver actually supports vector assignment.
Unavoidable sysfs/uevent race, see below.
Associate "vf_msix_count" with the VF:
- /sys/.../<PF>/sriov_vf_total_msix # on PF
- /sys/.../<VF>/sriov_vf_msix_count # on each VF
- The "sriov_vf_msix_count" files enter via the VF. Lock the VF to
ensure no VF driver, lookup and lock the PF for serialization,
call PF driver callback to assign vectors.
- Can be done with static sysfs attributes. This means creating
"sriov_vf_msix_count" *always*, even if PF driver doesn't support
vector assignment.
IIUC, putting "vf_msix_count_*" under the PF involves a race. When we
call device_add() for each new VF, it creates the VF sysfs directory
and emits the KOBJ_ADD uevent, but the "vf_msix_count_*" file doesn't
exist yet. It can't be created before device_add() because the sysfs
directory doesn't exist. If we create it after device_add(), the "add
VF" uevent has already been emitted, so userspace may consume it
before "vf_msix_count_*" is created.
sriov_enable
<set VF Enable> <-- VFs created on PCI
sriov_add_vfs
for (i = 0; i < num_vfs; i++) {
pci_iov_add_virtfn
pci_device_add
device_initialize
device_add
device_add_attrs <-- add VF sysfs attrs
kobject_uevent(KOBJ_ADD) <-- emit uevent
<-- add "vf_msix_count_*" sysfs attr
pci_iov_sysfs_link
pci_bus_add_device
pci_create_sysfs_dev_files
device_attach
}
Conceptually, I like having the "vf_total_msix" and "vf_msix_count_*"
files associated directly with the PF. I think that's more natural
because they both operate directly on the PF.
But I don't like the race, and using static attributes seems much
cleaner implementation-wise.
Bjorn
Powered by blists - more mailing lists