[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191025145808.GA20298@C02YVCJELVCG.dhcp.broadcom.net>
Date: Fri, 25 Oct 2019 10:58:08 -0400
From: Andy Gospodarek <andrew.gospodarek@...adcom.com>
To: Jakub Kicinski <jakub.kicinski@...ronome.com>
Cc: Yuval Avnery <yuvalav@...lanox.com>, Jiri Pirko <jiri@...nulli.us>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Jiri Pirko <jiri@...lanox.com>,
Saeed Mahameed <saeedm@...lanox.com>,
"leon@...nel.org" <leon@...nel.org>,
"davem@...emloft.net" <davem@...emloft.net>,
"shuah@...nel.org" <shuah@...nel.org>,
Daniel Jurgens <danielj@...lanox.com>,
andrew.gospodarek@...adcom.com,
Michael Chan <michael.chan@...adcom.com>
Subject: Re: [PATCH net-next 0/9] devlink vdev
On Wed, Oct 23, 2019 at 07:51:41PM -0700, Jakub Kicinski wrote:
> On Thu, 24 Oct 2019 00:11:48 +0000, Yuval Avnery wrote:
> > >>> We need some proper ontology and decisions what goes where. We have
> > >>> half of port attributes duplicated here, and hw_addr which honestly
> > >>> makes more sense in a port (since port is more of a networking
> > >>> construct, why would ep storage have a hw_addr?). Then you say you're
> > >>> going to dump more PCI stuff in here :(
> > >> Well basically what this "vdev" is is the "port peer" we discussed
> > >> couple of months ago. It provides possibility for the user on bare metal
> > >> to cofigure things for the VF - for example.
> > >>
> > >> Regarding hw_addr vs. port - it is not correct to make that a devlink
> > >> port attribute. It is not port's hw_addr, but the port's peer hw_addr.
> > > Yeah, I remember us arguing with others that "the other side of the
> > > wire" should not be a port.
> > >
> > >>> "vdev" sounds entirely meaningless, and has a high chance of becoming
> > >>> a dumping ground for attributes.
> > >> Sure, it is a madeup name. If you have a better name, please share.
> > > IDK. I think I started the "peer" stuff, so it made sense to me.
> > > Now it sounds like you'd like to kill a lot of problems with this
> > > one stone. For PCIe "vdev" is def wrong because some of the config
> > > will be for PF (which is not virtual). Also for PCIe the config has
> > > to be done with permanence in mind from day 1, PCI often requires
> > > HW reset to reconfig.
> >
> > The PF is "virtual" from the SmartNic embedded CPU point of view.
>
> We also want to configure PCIe on local host thru this in non-SmartNIC
> case, having the virtual in the name would be confusing there.
>
> > Maybe gdev is better? (generic)
>
> Let's focus on the scope and semantics of the object we are modelling
> first. Can we talk goals, requirements, user scenarios etc.?
>
> IMHO the hw_addr use case is kind of weak, clouds usually do tunnelling
> so nobody cares which MAC customer has assigned in the overlay.
>
> CCing Andy and Michael from Broadcom for their perspective and
> requirements.
Thanks, Jakub, I'm happy to chime in based on our deployment experience.
We definitely understand the desire to be able to configure properties
of devices on the SmartNIC (the kind with general purpose cores not the
kind with only flow offload) from the server side.
In addition to addressing NVMe devices, I'd also like to be be able to
create virtual or real serial ports as well as there is an interest in
*sometimes* being able to gain direct access to the SmartNIC console not
just a shell via ssh. So my point is that there are multiple use-cases.
Arm are also _extremely_ interested in developing a method to enable
some form of SmartNIC discovery method and while lots of ideas have been
thrown around, discovery via devlink is a reasonable option. So while
doing all this will be much more work than simply handling this case
where we set the peer or local MAC for a vdev, I think it will be worth
it to make this more usable for all^W more types of devices. I also
agree that not everything on the other side of the wire should be a
port.
So if we agree that addressing this device as a PCIe device then it
feels like we would be better served to query device capabilities and
depending on what capabilities exist we would be able to configure
properties for those. In an ideal world, I could query a device using
devlink ('devlink info'?) and it would show me different devices that
are available for configuration on the SmartNIC and would also give me a
way to address them. So while I like the idea of being able to address
and set parameters as shown in patch 05 of this series, I would like to
see a bit more flexibility to define what type of device is available
and how it might be configured.
So if we took the devlink info command as an example (whether its the
proper place for this or not), it could look _like_ this:
$ devlink dev info pci/0000:03:00.0
pci/0000:03:00.0:
driver foo
serial_number 8675309
versions:
[...]
capabilities:
storage 0
console 1
mdev 1024
[something else] [limit]
(Additionally rather than putting this as part of 'info' the device
capabilities and limits could be part of the 'resource' section and
frankly may make more sense if this is part of that.)
and then those capabilities would be something that could be set using the
'vdev' or whatever-it-is-named interface:
# devlink vdev show pci/0000:03:00.0
pci/0000:03:00.0/console/0: speed 115200 device /dev/ttySNIC0
pci/0000:03:00.0/mdev/0: hw_addr 02:00:00:00:00:00
[...]
pci/0000:03:00.0/mdev/1023: hw_addr 02:00:00:00:03:ff
# devlink vdev set pci/0000:03:00.0/mdev/0 hw_addr 00:22:33:44:55:00
Since these Arm/RISC-V based SmartNICs are going to be used in a variety
of different ways and will have a variety of different personalities
(not just different SKUs that vendors will offer but different ways in
which these will be deployed), I think it's critical that we consider
more than just the mdev/representer case from the start.
> > >> Basically it is something that represents VF/mdev - the other side of
> > >> devlink port. But in some cases, like NVMe, there is no associated
> > >> devlink port - that is why "devlink port peer" would not work here.
> > > What are the NVMe parameters we'd configure here? Queues etc. or some
> > > IDs? Presumably there will be a NVMe-specific way to configure things?
> > > Something has to point the NVMe VF to a backend, right?
> > >
> > > (I haven't looked much into NVMe myself in case that's not obvious ;))
Powered by blists - more mailing lists