[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191112145542.GA6619@C02YVCJELVCG>
Date: Tue, 12 Nov 2019 09:55:42 -0500
From: Andy Gospodarek <andrew.gospodarek@...adcom.com>
To: Yuval Avnery <yuvalav@...lanox.com>
Cc: Jakub Kicinski <jakub.kicinski@...ronome.com>,
Jiri Pirko <jiri@...lanox.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Saeed Mahameed <saeedm@...lanox.com>,
"leon@...nel.org" <leon@...nel.org>,
"davem@...emloft.net" <davem@...emloft.net>,
"shuah@...nel.org" <shuah@...nel.org>,
Daniel Jurgens <danielj@...lanox.com>,
Parav Pandit <parav@...lanox.com>,
"andrew.gospodarek@...adcom.com" <andrew.gospodarek@...adcom.com>,
"michael.chan@...adcom.com" <michael.chan@...adcom.com>
Subject: Re: [PATCH net-next v2 00/10] devlink subdev
On Mon, Nov 11, 2019 at 06:46:52PM +0000, Yuval Avnery wrote:
>
>
> > -----Original Message-----
> > From: Jakub Kicinski <jakub.kicinski@...ronome.com>
> > Sent: Monday, November 11, 2019 10:00 AM
> > To: Yuval Avnery <yuvalav@...lanox.com>; Jiri Pirko <jiri@...lanox.com>
> > Cc: netdev@...r.kernel.org; Saeed Mahameed <saeedm@...lanox.com>;
> > leon@...nel.org; davem@...emloft.net; shuah@...nel.org; Daniel Jurgens
> > <danielj@...lanox.com>; Parav Pandit <parav@...lanox.com>;
> > andrew.gospodarek@...adcom.com; michael.chan@...adcom.com
> > Subject: Re: [PATCH net-next v2 00/10] devlink subdev
> >
> > On Fri, 8 Nov 2019 18:18:36 +0200, Yuval Avnery wrote:
> > > This patchset introduces devlink subdev.
> > >
> > > Currently, legacy tools do not provide a comprehensive solution that
> > > can be used in both SmartNic and non-SmartNic mode.
> > > Subdev represents a device that exists on the ASIC but is not
> > > necessarily visible to the kernel.
> > >
> > > Using devlink ports is not suitable because:
> > >
> > > 1. Those devices aren't necessarily network devices (such as NVMe devices)
> > > and doesn’t have E-switch representation. Therefore, there is need for
> > > more generic representation of PCI VF.
> > > 2. Some attributes are not necessarily pure port attributes
> > > (number of MSIX vectors)
> >
> > PCIe attrs will require persistence. Where is that in this design?
> >
> > Also MSIX vectors are configuration of the devlink port (ASIC side), the only
> > reason you're putting them in subdev is because some of the subdevs don't
> > have a port, muddying up the meaning of things.
>
> The way I see this, they are both on the ASIC side.
> But port has nothing to do with MSIX so subdev makes more sense.
>
> >
> > > 3. It creates a confusing devlink topology, with multiple port flavours
> > > and indices.
> > >
> > > Subdev will be created along with flavour and attributes.
> > > Some network subdevs may be linked with a devlink port.
> > >
> > > This is also aimed to replace "ip link vf" commands as they are
> > > strongly linked to the PCI topology and allow access only to enabled VFs.
> > > Even though current patchset and example is limited to MAC address of
> > > the VF, this interface will allow to manage PF, VF, mdev in SmartNic
> > > and non SmartNic modes, in unified way for networking and
> > > non-networking devices via devlink instance.
> > >
> > > Use case example:
> > > An example system view of a networking ASIC (aka SmartNIC), can be
> > > seen in below diagram, where devlink eswitch instance and PCI PF
> > > and/or VFs are situated on two different CPU subsystems:
> > >
> > >
> > > +------------------------------+
> > > | |
> > > | HOST |
> > > | |
> > > | +----+-----+-----+-----+ |
> > > | | PF | VF0 | VF1 | VF2 | |
> > > +---+----+-----------+-----+---+
> > > PCI1|
> > > +---+------------+
> > > |
> > > +----------------------------------------+
> > > | | SmartNic |
> > > | +----+-------------------------+ |
> > > | | | |
> > > | | NIC | |
> > > | | | |
> > > | +---------------------+--------+ |
> > > | | PCI2 |
> > > | +-----+---------+--+ |
> > > | | |
> > > | +-----+--+--+--------------+ |
> > > | | | PF | | |
> > > | | +-----+ | |
> > > | | Embedded CPU | |
> > > | | | |
> > > | +--------------------------+ |
> > > | |
> > > +----------------------------------------+
> > >
> > > The below diagram shows an example devlink subdev topology where
> > some
> > > subdevs are connected to devlink ports::
> > >
> > >
> > >
> > > (PF0) (VF0) (VF1) (NVME VF2)
> > > +--------------------------+ +--------+
> > > | devlink| devlink| devlink| | devlink|
> > > | subdev | subdev | subdev | | subdev |
> > > | 0 | 1 | 2 | | 3 |
> > > +--------------------------+ +--------+
> > > | | |
> >
> > What is this NVME VF2 connected to? It's gotta get traffic from somewhere?
> > Frames come in from the uplink, then what?
>
> The whole point here is that this is not a network device
> So it has nothing to do with the network.
> It is simply a PCI device exposed by the NIC to the host.
> devlink subdev lets us configure attributes for this device.
>
> >
> > > | | |
> > > | | |
> > > +----------------------------------+
> > > | | devlink| devlink| devlink| |
> > > | | port | port | port | |
> > > | | 0 | 1 | 2 | |
> > > | +--------------------------+ |
> > > | |
> > > | |
> > > | E-switch |
> > > | |
> > > | |
> > > | +--------+ |
> > > | | uplink | |
> > > | | devlink| |
> > > | | port | |
> > > +----------------------------------+
> > >
> > > Devlink command example:
> > >
> > > A privileged user wants to configure a VF's hw_addr, before the VF is
> > > enabled.
> > >
> > > $ devlink subdev set pci/0000:03:00.0/1 hw_addr 10:22:33:44:55:66
> > >
> > > $ devlink subdev show pci/0000:03:00.0/1
> > > pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1 hw_addr
> > > 10:22:33:44:55:66
> > >
> > > $ devlink subdev show pci/0000:03:00.0/1 -jp {
> > > "subdev": {
> > > "pci/0000:03:00.0/1": {
> > > "flavour": "pcivf",
> >
> > If the flavour is pcivf what differentiates this (Ethernet) VF from a NVME
> > one?
> >
>
> This describes the bus not the device purpose.
>
> > > "pf": 0,
> > > "vf": 0,
> > > "port_index": 1,
> > > "hw_addr": "10:22:33:44:55:66"
> >
> > Since you're messing with the "hw_addr", you should at least provision the
> > uAPI for adding multiple addresses. Intel guys were asking for this long time
> > ago.
>
> This is the "permanent" address. The one the subdev boots with.
>
> >
> > Have you considered implementing some compat code so drivers don't have
> > to implement the legacy ndos if they support subdevs?
>
> Will consider
>
> >
> > > }
> > > }
> > > }
> >
> > Okay, so you added two diagrams. I guess I was naive in thinking that "you
> > thought this all through in detail and have more documentation and design
> > docs internally".
> >
> > I don't like how unconstrained this is, the only implemented use case is
> > weak. But since you're not seeing this you probably never will, so seems like
> > a waste of time to fight it.
>
> What am I missing? How would you constrain this?
It feels a bit 'unconstrained' to me as well. As Jakub said you added
some documentation, but the direction of this long-term is not clear.
What seems to happen too often is that we skip creating better infra for
existing devices and create it only for the newest shiniest object.
These changes are what garners the most attention from those who grant
permission to push things upstream (*cough* managers *cough*), but it's
not the right choice in this case. I'm not sure if that is part of what
bothers Jakub, but it is one thing that bothers me and why this feels
incomplete.
The thing that has been bouncing around in my head about this (and until
I was in front of a good text-based MUA I didn't dare respond) is that
we seem to have an overlap in functionality between what you are
proposing and existing virtual device configuration, but you are
completely ignoring improving upon the existing creation methods.
Whether dealing with a SmartNIC (which I used to describe a NIC with
general purpose processors that could be running Linux and I think you
do, too) or a regular NIC it seems like we should use devlink to create
and configure a variety of devices these could be:
1. Number of PFs (there are cases where more than 1 PF per uplink port
is required for command/control of the SmartNIC or where a single
PCI b/d/f may have many uplink ports that need to be addressed
separately)
2. Device-specific SR-IOV VFs visible to server
3. mdev devices that _may_ have representers on the embedded cores of
the NIC
4. Hardware VirtIO-net devices supported
5. Other non-network devices (storage given as the first example)
6. ...
We cannot get rid of the methods for creating hardware VFs via sysfs,
but now that we are seeing lots of different flavors of devices that
might be created we should start by making sure at a minimum we can
create existing hardware devices (VFs) with this devlink interface and
move from there. Is there a plan to use this interface to create
subdevs that are VFs or just new subdev flavors? I could start to get
behind an interface like this if the patches showed that devlink would
be the one place where device creation and control could be done for all
types of subdevs, but that isn't clear that it is your direction based
on the patches.
So just to make sure this is clear, what I'm proposing that devlink is
used to create and configure all types of devices that it may be
possible to create, configure, or address from a PF and the starting
place should be standard, hardware VFs. If that can be done right and
we can address some of the issues with the current implementation (more
than one hw addr is a _great_ example) then adding new flavours for the
server devices and and adding new flavors for SmartNIC-based devices
should be easy and feel natural.
Powered by blists - more mailing lists