[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fa7b6255-a20e-1a7d-2b9f-a2f409c9acac@mellanox.com>
Date: Mon, 28 Oct 2019 19:04:37 +0000
From: Yuval Avnery <yuvalav@...lanox.com>
To: Andy Gospodarek <andrew.gospodarek@...adcom.com>,
Jakub Kicinski <jakub.kicinski@...ronome.com>
CC: Jiri Pirko <jiri@...nulli.us>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Jiri Pirko <jiri@...lanox.com>,
Saeed Mahameed <saeedm@...lanox.com>,
"leon@...nel.org" <leon@...nel.org>,
"davem@...emloft.net" <davem@...emloft.net>,
"shuah@...nel.org" <shuah@...nel.org>,
Daniel Jurgens <danielj@...lanox.com>,
Michael Chan <michael.chan@...adcom.com>
Subject: Re: [PATCH net-next 0/9] devlink vdev
On 2019-10-25 7:58 a.m., Andy Gospodarek wrote:
> On Wed, Oct 23, 2019 at 07:51:41PM -0700, Jakub Kicinski wrote:
>> On Thu, 24 Oct 2019 00:11:48 +0000, Yuval Avnery wrote:
>>>>>> We need some proper ontology and decisions what goes where. We have
>>>>>> half of port attributes duplicated here, and hw_addr which honestly
>>>>>> makes more sense in a port (since port is more of a networking
>>>>>> construct, why would ep storage have a hw_addr?). Then you say you're
>>>>>> going to dump more PCI stuff in here :(
>>>>> Well basically what this "vdev" is is the "port peer" we discussed
>>>>> couple of months ago. It provides possibility for the user on bare metal
>>>>> to cofigure things for the VF - for example.
>>>>>
>>>>> Regarding hw_addr vs. port - it is not correct to make that a devlink
>>>>> port attribute. It is not port's hw_addr, but the port's peer hw_addr.
>>>> Yeah, I remember us arguing with others that "the other side of the
>>>> wire" should not be a port.
>>>>
>>>>>> "vdev" sounds entirely meaningless, and has a high chance of becoming
>>>>>> a dumping ground for attributes.
>>>>> Sure, it is a madeup name. If you have a better name, please share.
>>>> IDK. I think I started the "peer" stuff, so it made sense to me.
>>>> Now it sounds like you'd like to kill a lot of problems with this
>>>> one stone. For PCIe "vdev" is def wrong because some of the config
>>>> will be for PF (which is not virtual). Also for PCIe the config has
>>>> to be done with permanence in mind from day 1, PCI often requires
>>>> HW reset to reconfig.
>>> The PF is "virtual" from the SmartNic embedded CPU point of view.
>> We also want to configure PCIe on local host thru this in non-SmartNIC
>> case, having the virtual in the name would be confusing there.
>>
>>> Maybe gdev is better? (generic)
>> Let's focus on the scope and semantics of the object we are modelling
>> first. Can we talk goals, requirements, user scenarios etc.?
>>
>> IMHO the hw_addr use case is kind of weak, clouds usually do tunnelling
>> so nobody cares which MAC customer has assigned in the overlay.
>>
>> CCing Andy and Michael from Broadcom for their perspective and
>> requirements.
> Thanks, Jakub, I'm happy to chime in based on our deployment experience.
> We definitely understand the desire to be able to configure properties
> of devices on the SmartNIC (the kind with general purpose cores not the
> kind with only flow offload) from the server side.
>
> In addition to addressing NVMe devices, I'd also like to be be able to
> create virtual or real serial ports as well as there is an interest in
> *sometimes* being able to gain direct access to the SmartNIC console not
> just a shell via ssh. So my point is that there are multiple use-cases.
>
> Arm are also _extremely_ interested in developing a method to enable
> some form of SmartNIC discovery method and while lots of ideas have been
> thrown around, discovery via devlink is a reasonable option. So while
> doing all this will be much more work than simply handling this case
> where we set the peer or local MAC for a vdev, I think it will be worth
> it to make this more usable for all^W more types of devices. I also
> agree that not everything on the other side of the wire should be a
> port.
>
> So if we agree that addressing this device as a PCIe device then it
> feels like we would be better served to query device capabilities and
> depending on what capabilities exist we would be able to configure
> properties for those. In an ideal world, I could query a device using
> devlink ('devlink info'?) and it would show me different devices that
> are available for configuration on the SmartNIC and would also give me a
> way to address them. So while I like the idea of being able to address
> and set parameters as shown in patch 05 of this series, I would like to
> see a bit more flexibility to define what type of device is available
> and how it might be configured.
>
> So if we took the devlink info command as an example (whether its the
> proper place for this or not), it could look _like_ this:
>
> $ devlink dev info pci/0000:03:00.0
> pci/0000:03:00.0:
> driver foo
> serial_number 8675309
> versions:
> [...]
> capabilities:
> storage 0
> console 1
> mdev 1024
> [something else] [limit]
>
> (Additionally rather than putting this as part of 'info' the device
> capabilities and limits could be part of the 'resource' section and
> frankly may make more sense if this is part of that.)
>
> and then those capabilities would be something that could be set using the
> 'vdev' or whatever-it-is-named interface:
>
> # devlink vdev show pci/0000:03:00.0
> pci/0000:03:00.0/console/0: speed 115200 device /dev/ttySNIC0
> pci/0000:03:00.0/mdev/0: hw_addr 02:00:00:00:00:00
> [...]
> pci/0000:03:00.0/mdev/1023: hw_addr 02:00:00:00:03:ff
>
> # devlink vdev set pci/0000:03:00.0/mdev/0 hw_addr 00:22:33:44:55:00
I believe the flexibility you are looking for is already there.
The driver is free to put any attribute to classify the vdev.
I think the simple vdev index and attributes leaves more flexibility:
# devlink vdev show pci/0000:03:00.0
pci/0000:03:00.0/0 console 0 speed 115200 device /dev/ttySNIC0
pci/0000:03:00.0/1 mdev 0 hw_addr 02:00:00:00:00:00
[...]
pci/0000:03:00.0/1025 hw_addr 02:00:00:00:03:ff
So besides that point I think we are on the same page here?
> Since these Arm/RISC-V based SmartNICs are going to be used in a variety
> of different ways and will have a variety of different personalities
> (not just different SKUs that vendors will offer but different ways in
> which these will be deployed), I think it's critical that we consider
> more than just the mdev/representer case from the start.
>
>>>>> Basically it is something that represents VF/mdev - the other side of
>>>>> devlink port. But in some cases, like NVMe, there is no associated
>>>>> devlink port - that is why "devlink port peer" would not work here.
>>>> What are the NVMe parameters we'd configure here? Queues etc. or some
>>>> IDs? Presumably there will be a NVMe-specific way to configure things?
>>>> Something has to point the NVMe VF to a backend, right?
>>>>
>>>> (I haven't looked much into NVMe myself in case that's not obvious ;))
Powered by blists - more mailing lists