[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <997dbf25-a3e1-168c-c756-b33e79e7c51e@mellanox.com>
Date: Sat, 21 Mar 2020 09:07:30 +0000
From: Parav Pandit <parav@...lanox.com>
To: Jakub Kicinski <kuba@...nel.org>, Jiri Pirko <jiri@...nulli.us>
CC: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"davem@...emloft.net" <davem@...emloft.net>,
Yuval Avnery <yuvalav@...lanox.com>,
"jgg@...pe.ca" <jgg@...pe.ca>,
Saeed Mahameed <saeedm@...lanox.com>,
"leon@...nel.org" <leon@...nel.org>,
"andrew.gospodarek@...adcom.com" <andrew.gospodarek@...adcom.com>,
"michael.chan@...adcom.com" <michael.chan@...adcom.com>,
Moshe Shemesh <moshe@...lanox.com>,
Aya Levin <ayal@...lanox.com>,
Eran Ben Elisha <eranbe@...lanox.com>,
Vlad Buslov <vladbu@...lanox.com>,
Yevgeny Kliteynik <kliteyn@...lanox.com>,
"dchickles@...vell.com" <dchickles@...vell.com>,
"sburla@...vell.com" <sburla@...vell.com>,
"fmanlunas@...vell.com" <fmanlunas@...vell.com>,
Tariq Toukan <tariqt@...lanox.com>,
"oss-drivers@...ronome.com" <oss-drivers@...ronome.com>,
"snelson@...sando.io" <snelson@...sando.io>,
"drivers@...sando.io" <drivers@...sando.io>,
"aelior@...vell.com" <aelior@...vell.com>,
"GR-everest-linux-l2@...vell.com" <GR-everest-linux-l2@...vell.com>,
"grygorii.strashko@...com" <grygorii.strashko@...com>,
mlxsw <mlxsw@...lanox.com>, Ido Schimmel <idosch@...lanox.com>,
Mark Zhang <markz@...lanox.com>,
"jacob.e.keller@...el.com" <jacob.e.keller@...el.com>,
Alex Vesker <valex@...lanox.com>,
"linyunsheng@...wei.com" <linyunsheng@...wei.com>,
"lihong.yang@...el.com" <lihong.yang@...el.com>,
"vikas.gupta@...adcom.com" <vikas.gupta@...adcom.com>,
"magnus.karlsson@...el.com" <magnus.karlsson@...el.com>
Subject: Re: [RFC] current devlink extension plan for NICs
On 3/21/2020 2:55 AM, Jakub Kicinski wrote:
> On Fri, 20 Mar 2020 08:35:55 +0100 Jiri Pirko wrote:
>> Fri, Mar 20, 2020 at 04:32:53AM CET, kuba@...nel.org wrote:
>>> On Thu, 19 Mar 2020 20:27:19 +0100 Jiri Pirko wrote:
>> I'm not sure I follow. There is a number of PFs created "on probe".
>> Those are fixed and driver knows where to put them.
>> The comment was about possible "dynamic PF" created by user in the same
>> was he creates SF, by devlink cmdline.
>
> How does the driver differentiate between a dynamic and static PF,
> and why are they different in the first place? :S
>
I don't see a need of PF driver to differentiate them as static/dynamic.
In either case pci probe(), remove() are triggered regardless.
They are different because a smartnic administrator wants to create them
dynamically on user request and provide to the system where this
smartnic is connected.
> Also, once the PFs are created user may want to use them together
>>>> $ devlink port show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 0
>>>> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 type eth netdev enp6s0pf0vf1 slice 1
>>>> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 type eth netdev enp6s0pf1vf0 slice 2
>>>> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 type eth netdev enp6s0pf1vf1 slice 3
>>>>
>>>> $ devlink slice show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
>>>> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr 10:22:33:44:55:77 state active
>>>> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
>>>> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
>>>> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2
>>>>
>>>> In these 2 outputs you can see the relationships. Attributes "slice" and "port"
>>>> indicate the slice-port pairs.
>>>>
>>>> Also, there is a fixed "state" attribute with value "active". This is by
>>>> default as the VFs are always created active. In future, it is planned
>>>> to implement manual VF creation and activation, similar to what is below
>>>> described for SFs.
>>>>
>>>> Note that for non-ethernet slice (the last one) does not have any
>>>> related port port. It can be for example NVE or IB. But since
>>>> the "hw_addr" attribute is also omitted, it isn't IB.
>>>>
>>>>
>>>> Now set a different MAC address for VF1 on PF0:
>>>> $ devlink slice set pci/0000:06:00.0/3 hw_addr aa:bb:cc:dd:ee:ff
>>>>
>>>> $ devlink slice show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66 state active
>>>> pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 port 3 hw_addr aa:bb:cc:dd:ee:ff state active
>>>> pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 port 4 hw_addr 10:22:33:44:55:88 state active
>>>> pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 port 5 hw_addr 10:22:33:44:55:99 state active
>>>> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2
>>>
>>> What are slices?
>>
>> Slice is basically a piece of ASIC. pf/vf/sf. They serve for
>> configuration of the "other side of the wire". Like the mac. Hypervizor
>> admin can use the slite to set the mac address of a VF which is in the
>> virtual machine. Basically this should be a replacement of "ip vf"
>> command.
>
> I lost my mail archive but didn't we already have a long thread with
> Parav about this?
>
I have more verbose RFC that talks about motivation of slice and similar
examples which I will post along with actual patches.
But I think Jiri captured thie jist already here, i.e. slice drives the
device life cycle (with/without eswitch).
slice device activation model is more than just port unplug/plug event.
for which we shouldn't overload the devlink port which doesn't undergo
state transition explained for slice.
>>>> ==================================================================
>>>> || ||
>>>> || SF (subfunction) user cmdline API draft ||
>>>> || ||
>>>> ==================================================================
>>>>
>>>> Note that some of the "devlink port" attributes may be forgotten or misordered.
>>>>
>>>> Note that some of the "devlink slice" attributes in show commands
>>>> are omitted on purpose.
>>>>
>>>> $ devlink port show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2
>>>>
>>>> $ devlink slice show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
>>>>
>>>> There is one VF on the NIC.
>>>>
>>>>
>>>> Now create subfunction of SF0 on PF1, index of the slice is going to be 100
>>>> and hw_address aa:bb:cc:aa:bb:cc.
>>>>
>>>> $ devlink slice add pci/0000.06.00.0/100 flavour pcisf pfnum 1 sfnum 10 hw_addr aa:bb:cc:aa:bb:cc
>>>
>>> Why is the SF number specified by the user rather than allocated?
>>
>> Because it is snown in representor netdevice name. And you need to have
>> it predetermined: enp6s0pf1sf10
>
> I'd think you need to know what was assigned, not necessarily pick
> upfront.. I feel like we had this conversation before as well.
>
Assigning by user ensures that netdev and/or rdma device of a slice (sub
function), its representor, health reporters, devlink instance of the
slice gets predictable identity.
>>>> The devlink kernel code calls down to device driver (devlink op) and asks
>>>> it to create a slice with particular attributes. Driver then instatiates
>>>> the slice and port in the same way it is done for VF:
>>>>
>>>> $ devlink port show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0 slice 2
>>>> pci/0000:06:00.0/3: flavour pcisf pfnum 1 sfnum 10 type eth netdev enp6s0pf1sf10 slice 100
>>>>
>>>> $ devlink slice show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
>>>> pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 port 3 hw_addr aa:bb:cc:aa:bb:cc state inactive
>>>>
>>>> Note that the SF slice is created but not active. That means the
>>>> entities are created on devlink side, the e-switch port representor
>>>> is created, but the SF device itself it not yet out there (same host
>>>> or different, depends on where the parent PF is - in this case the same host).
>>>> User might use e-switch port representor enp6s0pf1sf10 to do settings,
>>>> putting it into bridge, adding TC rules, etc.
>>>> It's like the cable is unplugged on the other side.
>>>
>>> If it's just "cable unplugged" can't we just use the fact the
>>> representor is down to indicate no traffic can flow?
>>
>> It is not "cable unplugged". This "state inactive/action" is admin
>> state. You as a eswitch admin say, "I'm done with configuring a slice
>> (MAC) and a representor (bridge, TC, etc) for this particular SF and
>> I want the HOST to instantiate the SF instance (with the configured
>> MAC).
>
> I'm not opposed, I just don't understand the need. If ASIC will not
> RX or TX any traffic from/to this new entity until repr is brought up
> there should be no problem.
>
You may still want device to be active and unplug the cable or plug the
cable later on.
Additionally, once a slice is inactivated, portion of the PCI BAR space
should be inactivated too used by this slice, so that a device/fw can
drop such access by an untrusted software that may happen even after
inactivating the slice.
>>>> Now we activate (deploy) the SF:
>>>> $ devlink slice set pci/0000:06:00.0/100 state active
>>>>
>>>> $ devlink slice show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 port 2 hw_addr 10:22:33:44:55:66
>>>> pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 port 3 hw_addr aa:bb:cc:aa:bb:cc state active
>>>>
>>>> Upon the activation, the device driver asks the device to instantiate
>>>> the actual SF device on particular PF. Does not matter if that is
>>>> on the same host or not.
>>>>
>>>> On the other side, the PF driver instance gets the event from device
>>>> that particular SF was activated. It's the cue to put the device on bus
>>>> probe it and instantiate netdev and devlink instances for it.
>>>
>>> Seems backwards. It's the PF that wants the new function, why can't it
>>> just create it and either get an error from the other side or never get
>>> link up?
>>
>> We discussed that many times internally. I think it makes sense that
>> the SF is created by the same entity that manages the related eswitch
>> SF-representor. In other words, the "devlink slice" and related "devlink
>> port" object are under the same devlink instance.
>>
>> If the PF in host manages nested switch, it can create the SF inside and
>> manages the neste eswitch SF-representor as you describe.
>>
>> It is a matter of "nested eswitch manager on/off" configuration.
>>
>> I think this is clean model and it is known who has what
>> responsibilities.
>
> I see so you want the creation to be controlled by the same entity that
> controls the eswitch..
>
> To me the creation should be on the side that actually needs/will use
> the new port. And if it's not eswitch manager then eswitch manager
> needs to ack it.
>
There are few reasons to create them on eswitch manager system as below.
1. Creation and deletion on one system and synchronizing it with eswitch
system requires multiple back-n-forth calls between two systems.
2. When this happens, system where its created, doesn't know when is the
right time to provision to a VM or to a application.
udev/systemd/Network Manager and others such software might already
start initializing it doing DHCP but its switch side is not yet ready.
So it is desired to make sure that once device is fully
ready/configured, its activated.
3. Additionally it doesn't follow mirror sequence during deletion when
created on host.
4. eswitch administrator simply doesn't have direct access to the system
where this device is used. So it just cannot be created there by eswitch
administrator.
> The precedence is probably not a strong argument, but that'd be the
> same way VFs work.. I don't think you can change how VFs work, right?
>
At least not with the existing available mlx5 devices. But this model
enables future VF devices to operate in same way as SF.
>>>> ==================================================================
>>>> || ||
>>>> || VF manual creation and activation user cmdline API draft ||
>>>> || ||
>>>> ==================================================================
>>>>
>>>> To enter manual mode, the user has to turn off VF dummies creation:
>>>> $ devlink dev set pci/0000:06:00.0 vf_dummies disabled
>>>> $ devlink dev show
>>>> pci/0000:06:00.0: vf_dummies disabled
>>>>
>>>> It is "enabled" by default in order not to break existing users.
>>>>
>>>> By setting the "vf_dummies" attribute to "disabled", the driver
>>>> removes all dummy VFs. Only physical ports are present:
>>>>
>>>> $ devlink port show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>>>>
>>>> Then the user is able to create them in a similar way as SFs:
>>>>
>>>> $ devlink slice add pci/0000:06:00.0/99 flavour pcivf pfnum 1 vfnum 8 hw_addr aa:bb:cc:aa:bb:cc
>>>>
>>>> The devlink kernel code calls down to device driver (devlink op) and asks
>>>> it to create a slice with particular attributes. Driver then instatiates
>>>> the slice and port:
>>>>
>>>> $ devlink port show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>>>> pci/0000:06:00.0/2: flavour pcivf pfnum 1 vfnum 8 type eth netdev enp6s0pf1vf0 slice 99
>>>>
>>>> $ devlink slice show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 port 1 state active
>>>> pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 port 2 hw_addr aa:bb:cc:aa:bb:cc state inactive
>>>>
>>>> Now we activate (deploy) the VF:
>>>> $ devlink slice set pci/0000:06:00.0/99 state active
>>>>
>>>> $ devlink slice show
>>>> pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 port 2 hw_addr aa:bb:cc:aa:bb:cc state active
Powered by blists - more mailing lists