netdev - Re: [RFC] current devlink extension plan for NICs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50c0f739-592e-77a4-4872-878f99cc8b93@mellanox.com>
Date:   Tue, 31 Mar 2020 07:45:51 +0000
From:   Parav Pandit <parav@...lanox.com>
To:     Jakub Kicinski <kuba@...nel.org>
CC:     Jiri Pirko <jiri@...nulli.us>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "davem@...emloft.net" <davem@...emloft.net>,
        Yuval Avnery <yuvalav@...lanox.com>,
        "jgg@...pe.ca" <jgg@...pe.ca>,
        Saeed Mahameed <saeedm@...lanox.com>,
        "leon@...nel.org" <leon@...nel.org>,
        "andrew.gospodarek@...adcom.com" <andrew.gospodarek@...adcom.com>,
        "michael.chan@...adcom.com" <michael.chan@...adcom.com>,
        Moshe Shemesh <moshe@...lanox.com>,
        Aya Levin <ayal@...lanox.com>,
        Eran Ben Elisha <eranbe@...lanox.com>,
        Vlad Buslov <vladbu@...lanox.com>,
        Yevgeny Kliteynik <kliteyn@...lanox.com>,
        "dchickles@...vell.com" <dchickles@...vell.com>,
        "sburla@...vell.com" <sburla@...vell.com>,
        "fmanlunas@...vell.com" <fmanlunas@...vell.com>,
        Tariq Toukan <tariqt@...lanox.com>,
        "oss-drivers@...ronome.com" <oss-drivers@...ronome.com>,
        "snelson@...sando.io" <snelson@...sando.io>,
        "drivers@...sando.io" <drivers@...sando.io>,
        "aelior@...vell.com" <aelior@...vell.com>,
        "GR-everest-linux-l2@...vell.com" <GR-everest-linux-l2@...vell.com>,
        "grygorii.strashko@...com" <grygorii.strashko@...com>,
        mlxsw <mlxsw@...lanox.com>, Ido Schimmel <idosch@...lanox.com>,
        Mark Zhang <markz@...lanox.com>,
        "jacob.e.keller@...el.com" <jacob.e.keller@...el.com>,
        Alex Vesker <valex@...lanox.com>,
        "linyunsheng@...wei.com" <linyunsheng@...wei.com>,
        "lihong.yang@...el.com" <lihong.yang@...el.com>,
        "vikas.gupta@...adcom.com" <vikas.gupta@...adcom.com>,
        "magnus.karlsson@...el.com" <magnus.karlsson@...el.com>
Subject: Re: [RFC] current devlink extension plan for NICs

On 3/31/2020 1:06 AM, Jakub Kicinski wrote:
> On Mon, 30 Mar 2020 07:48:39 +0000 Parav Pandit wrote:
>> On 3/27/2020 10:08 PM, Jakub Kicinski wrote:
>>> On Fri, 27 Mar 2020 08:47:36 +0100 Jiri Pirko wrote:  
>>>>> So the queues, interrupts, and other resources are also part 
>>>>> of the slice then?    
>>>>
>>>> Yep, that seems to make sense.
>>>>  
>>>>> How do slice parameters like rate apply to NVMe?    
>>>>
>>>> Not really.
>>>>  
>>>>> Are ports always ethernet? and slices also cover endpoints with
>>>>> transport stack offloaded to the NIC?    
>>>>
>>>> devlink_port now can be either "ethernet" or "infiniband". Perhaps,
>>>> there can be port type "nve" which would contain only some of the
>>>> config options and would not have a representor "netdev/ibdev" linked.
>>>> I don't know.  
>>>
>>> I honestly find it hard to understand what that slice abstraction is,
>>> and which things belong to slices and which to PCI ports (or why we even
>>> have them).
>>>   
>> In an alternative, devlink port can be overloaded/retrofit to do all
>> things that slice desires to do.
> 
> I wouldn't say retrofitted, in my mind port has always been a port of 
> a device.
> 
But here a networking device is getting created on host system that has
connection to an eswitch port.

> Jiri explained to me that to Mellanox port is port of a eswitch, not
> port of a device. While to me (/Netronome) it was any way to send or
> receive data to/from the device.
> 
> Now I understand why to you nvme doesn't fit the port abstraction.
> 
ok. Great.

>> For that matter representor netdev can be overloaded/extended to do what
>> slice desire to do (instead of devlink port).
> 
> Right, in my mental model representor _is_ a port of the eswitch, so
> repr would not make sense to me.
>
Right. So eswitch devlink port (pcipf, pcivf) flavours are also not the
right object to use as it represents eswitch side.

So either we create a new devlink port flavour which is facing the host
and run the state machine for those devlink ports or we create a more
refined object as slice and anchor things there.

>> Can you please explain why you think devlink port should be overloaded
>> instead of netdev or any other kernel object?
>> Do you have an example of such overloaded functionality of a kernel object?
>> Like why macvlan and vlan drivers are not combined to in single driver
>> object? Why teaming and bonding driver are combined in single driver
>> object?...
> 
> I think it's not overloading, but the fact that we started with
> different definitions. We (me and you) tried adding the PCIe ports
> around the same time, I guess we should have dug into the details
> right away.
>
Yes. :-)

>> User should be able to create, configure, deploy, delete a 'portion of
>> the device' with/without eswitch.
> 
> Right, to me ports are of the device, not eswitch.
> 
True. We are aligned here.

>> We shouldn't be starting with restrictive/narrow view of devlink port.
>>
>> Internally with Jiri and others, we also explored the possibility to
>> have 'mgmtvf', 'mgmtpf',  'mgmtsf' port flavours by overloading port to
>> do all things as that of slice.
>> It wasn't elegant enough. Why not create right object?
> 
> We just need clear definitions of what goes where. 
Yes.
The proposal is straight forward here.
that is,
(a) if a user wants to control/monitor params of the PF/VF/SF which is
facing the particular function (PF/VF/SF), such as mac, irq, num_qs,
state machine etc,
Those are anchored at the slice (portion of the device) level.
I detail how the whole plumbing in the extended RFC content in the
thread yday.

(b) if a user wants to control/monitor params which are towards the
eswitch level, they are either done through representor netdev or
devlink eswitch side port.
For example, eswitch pci vf's internal flow table should be exposed via
dpipe linked to eswitch devlink port.


> We already have
> params etc. hanging off the ports, including irq/sriov stuff. But in
> slice model those don't belong there :S
> 
I looked at the DaveM net-next tree today.
Only driver that uses devlink port params is bnxt. Even this driver
registers empty array of port parameters.
sriov/irq stuff currently hanging off at the devlink device level for
its own device.
Can you please provide link to code that uses devlink port params?

> In fact very little belongs to the port in that model. So why have
> PCI ports in the first place?
>
for few reasons.
1. PCI ports are establishing the relationship between eswitch port and
its representor netdevice.
Relying on plain netdev name doesn't work in certain pci topology where
netdev name exceeds 15 characters.
2. health reporters can be at port level.
3. In future at eswitch pci port, I will be adding dpipe support for the
internal flow tables done by the driver.
4. There were inconsistency among vendor drivers in using/abusing
phys_port_name of the eswitch ports. This is consolidated via devlink
port in core. This provides consistent view among all vendor drivers.

So PCI eswitch side ports are useful regardless of slice.

>> Additionally devlink port object doesn't go through the same state
>> machine as that what slice has to go through.
>> So its weird that some devlink port has state machine and some doesn't.
> 
> You mean for VFs? I think you can add the states to the API.
> 
As we agreed above that eswitch side objects (devlink port and
representor netdev) should not be used for 'portion of device',

we certainly need to create either
(a) new devlink ports and their host facing flavour(s) and run state
machine for it
or
(b) new devlink slice object that represents the 'portion of the device'.

We can add the state machine to the port. However it suffers from issue
that certain flavour as physical, dsa, eswitch ports etc doesn't have
notion of state machine, attachment to driver etc.

This is where I find it that we are overloading the port beyond its
current definition. And extensions doesn't seem to become applicable in
future on those ports.

A 'portion of the device' as individual object that optionally can be
linked to eswitch port made more sense. (like how a devlink port
optionally links to representor).

>>> With devices like NFP and Mellanox CX3 which have one PCI PF maybe it
>>> would have made sense to have a slice that covers multiple ports, but
>>> it seems the proposal is to have port to slice mapping be 1:1. And rate
>>> in those devices should still be per port not per slice.
>>>   
>> Slice can have multiple ports. slice object doesn't restrict it. User
>> can always split the port for a device, if device support it.
> 
> Okay, so slices are not 1:1 with ports, then?  Is it any:any?
> 
A slice can attach to one or more eswitch port, if slice wants to
support eswitch offloads etc.

A slice without eswitch, can have zero eswitch ports linked to it.