[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7d2b9a8e-07b3-093c-95d1-ab165b56c903@mellanox.com>
Date: Mon, 30 Mar 2020 09:07:47 +0000
From: Parav Pandit <parav@...lanox.com>
To: Jakub Kicinski <kuba@...nel.org>,
Saeed Mahameed <saeedm@...lanox.com>
CC: "sridhar.samudrala@...el.com" <sridhar.samudrala@...el.com>,
Aya Levin <ayal@...lanox.com>,
"andrew.gospodarek@...adcom.com" <andrew.gospodarek@...adcom.com>,
"sburla@...vell.com" <sburla@...vell.com>,
"jiri@...nulli.us" <jiri@...nulli.us>,
Tariq Toukan <tariqt@...lanox.com>,
"davem@...emloft.net" <davem@...emloft.net>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Vlad Buslov <vladbu@...lanox.com>,
"lihong.yang@...el.com" <lihong.yang@...el.com>,
Ido Schimmel <idosch@...lanox.com>,
"jgg@...pe.ca" <jgg@...pe.ca>,
"fmanlunas@...vell.com" <fmanlunas@...vell.com>,
"oss-drivers@...ronome.com" <oss-drivers@...ronome.com>,
"leon@...nel.org" <leon@...nel.org>,
"grygorii.strashko@...com" <grygorii.strashko@...com>,
"michael.chan@...adcom.com" <michael.chan@...adcom.com>,
Alex Vesker <valex@...lanox.com>,
"snelson@...sando.io" <snelson@...sando.io>,
"linyunsheng@...wei.com" <linyunsheng@...wei.com>,
"magnus.karlsson@...el.com" <magnus.karlsson@...el.com>,
"dchickles@...vell.com" <dchickles@...vell.com>,
"jacob.e.keller@...el.com" <jacob.e.keller@...el.com>,
Moshe Shemesh <moshe@...lanox.com>,
Mark Zhang <markz@...lanox.com>,
"aelior@...vell.com" <aelior@...vell.com>,
Yuval Avnery <yuvalav@...lanox.com>,
"drivers@...sando.io" <drivers@...sando.io>,
mlxsw <mlxsw@...lanox.com>,
"GR-everest-linux-l2@...vell.com" <GR-everest-linux-l2@...vell.com>,
Yevgeny Kliteynik <kliteyn@...lanox.com>,
"vikas.gupta@...adcom.com" <vikas.gupta@...adcom.com>,
Eran Ben Elisha <eranbe@...lanox.com>
Subject: Re: [RFC] current devlink extension plan for NICs
On 3/28/2020 2:12 AM, Jakub Kicinski wrote:
> On Fri, 27 Mar 2020 19:45:53 +0000 Saeed Mahameed wrote:
>> from what i understand, a real slice is a full isolated HW pipeline
>> with its own HW resources and HW based isolation, a slice rings/hw
>> resources can never be shared between different slices, just like a vf,
>> but without the pcie virtual function back-end..
>> We need a clear-cut definition of what a Sub-function slice is.. this
>> RFC doesn't seem to address that clearly.
>
> Definitely. I'd say we need a clear definition of (a) what a
> sub-functions is, and (b) what a slice is.
>
Below is the extended contents for Jiri's RFC that addresses Saeed point
on clear definition of slice, sub-function and their plumbing.
Few things already defined by Jiri's RFC such as diagrams, examples, but
putting it all together below for completeness.
This RFC extension covers overall requirements, design, alternatives
considered and rationale for current design choices to support
sub-functions using slices.
Overview:
---------
A user wants to use multiple netdev and rdma devices of a PCI device
without using PCI SR-IOV. This is done by creating slices of a
PCI device using devlink tool.
Requirements:
-------------
1. Able to make multiple slices of a device
2. Able to provision such slice to use in-kernel
3. Able to have persistent name of netdev and rdma device of a slice
4. Reuse current power management (suspend/resume) kernel infra for
slice devices; not reinvent locking etc in each driver or devlink
5. Able to control, steer, offload network traffic of slices using
existing devlink eswitch switchdev support
6. Suddenly not change existing pci pf/vf naming scheme by introducing
slice
7. Support slice of a device in an iommu enabled kernel
8. Dynamically create/delete a slice
9. Ability to configure a slice before deploying it
10. Get/Set slice attributes like PCI VF slice attributes
11. Reuse/extend virtbus for carrying SF slice devices
12. Hot-plug a slice device in host system from the NIC
(eswitch system) side without running any agent in the
host system
13. Have unified interface for slice management regardless of
deploying on eswitch system or on host system.
14. User must be able to create a portion of the device from eswitch
system and attach to the untrusted host system. This host system
is not accessible to eswitch system for purpose of device
life-cycle and initial configuration.
Slice:
------
A slice represents a portion of the device. A slice is a generic
object that represents either a PCI VF or PCI PF or
PCI sub function (SF) described below.
Sub-function (SF):
------------------
- An sub-function is a portion of the PCI device which supports multiple
class of devices such as netdev, rdma and more.
- An SF netdev has its own dedicated queues(txq, rxq).
- An SF rdma device has its own QP1, GID table and rdma resources.
An SF rdma resources has its own resource namespace.
- An SF supports eswitch representation and full offload support
when it is running with eswitch support.
- User must configure eswitch to send/receive packets for an SF.
- An SF shares PCI level resources with other SFs and/or with its
parent PCI function.
For example, an SF shares IRQ vectors with other SFs and its
PCI function.
In future it may have dedicated IRQ vector per SF.
An SF has dedicated window in PCI BAR space that is not shared
with other SFs or PF. This ensures that when a SF is assigned to
an application, only that application can access device resources.
Overall design:
---------------
A new flavour of slice is created that represents a portion of the
device. It is equivalent to a PCI VF slice, but new slice exists
without PCI SR-IOV. This can scale to possibly more than number
of SR-IOV VFs. A new slice flavour 'pcisf' is introduced which is
explained later in this document.
devlink subsystem is extended to create, delete and deploy a slice
using a devlink instance.
Slice life cycle is done using devlink commands explained below.
(a) Lifecycle command consist of 3 main commands i.e. add, delete and
state change.
(b) Add/delete commands create or delete the slice of a device
respectively.
(c) A slice undergoes one or more configuration before it is
activated. This may include,
(a) slice hardware address configuration
(b) representor netdevice configuration
(c) Network policy, steering configuration through representor
(d) Once a slice is fully configured, user activates it.
Slice activation triggers device enumeration and binding to the
driver.
(e) Each slice goes through a state transition during its life cycle.
(f) Each slice's admin state is controlled by the user. Slice
operational is updated based on driver attach/detach tasks on the
host system.
such as,
admin state description
----------- ------------
1. inactive Slice enter this state when it is newly created.
User typically does most of the configuration in
this state before activating the slice.
2. active State when slice is just activated by user.
operational description
state
----------- ------------
1. attached State when slice device is bound to the host
driver. When the slice device is unbound from the
host driver, slice device exits this state and
enters detaching state.
2. detaching State when host is notified to deactivate the
slice device and slice device may be undergoing
detachment from host driver. When slice device is
fully detached from the host driver, slice exits
this state and enters detached state.
3. detached State when driver is fully unbound, it enters
into detached state.
slice state machine:
--------------------
slice state set inactive
----<------------------<---
| or slice delete |
| |
__________ ____|_______ ____|_______
| | slice add | |slice state | |
| |-------->---| |------>-----| |
| invalid | | inactive | set active | active |
| | slice del | | | |
|__________|--<---------|____________| |____________|
slice device operational state machine:
---------------------------------------
__________ ____________ ___________
| | slice state | |driver bus | |
| invalid |-------->-----| detached |------>-----| attached |
| | set active | | probe() | |
|__________| |____________| |___________|
| |
^ slice set
| set inactive
successful detach |
or pf reset |
____|_______ |
| | driver bus |
-----------| detaching |---<-------------
| | | remove()
^ |____________|
| timeout |
--<---------------
More on the slice states:
(a) Primary reason to run slice life cycle this way is: a user must
be able to create and configure a slice on eswitch system, before
host system discovers the slice device. Hence, a slice
device will not be accessible from the host system until it is
explicitly activated by the user. Typically this is done after
all necessary slice attributes are configured.
(b) A user wants to create and configure the slice on eswitch system
when host system is power-down state.
(c) A slice interface for sub-functions should be uniform regardless
of slice devices are deployed in eswitch system or host system.
(d) If a host system software where slice is deployed is compromised,
which doesn't detach the slice, such slice remains in detaching
state until PF driver is reset on the host. Such slice won't be
usable until it is detached gracefully by host software.
Forcefully changing its state and reusing it can lead to
unexpected behavior and access of slice resources. Hence when
opstate = detaching and state = inactive, such slice cannot be
activated.
SF sysfs, virtbus and devlink plumbing:
---------------------------------------
(a) Given that a sub-function device may have PCI BAR resource(s), power
management, IRQ configuration, persistence naming etc, a clear sysfs
view like existing PCI device is desired.
Therefore, an SF resides on a new bus called virtbus.
This virtbus holds one or more SFs of a parent PCI device.
Whenever user activates a slice, corresponding sub-function device
is created on the virtbus and attached to the driver.
(b) Each SF has user defined unique number associated with it,
called 'sfnum'. This sfnum is provided during SF slice creation
time. Multiple uses of the sfnum is explained in detail below.
(c) An active SF slice has a unique 'struct device' anchored on
virtbus. An SF is identified using unique name on the virtbus.
(d) An SF's device (sysfs) name is created using ida assigned by the
virtbus core. such as,
/sys/bus/virtbus/devices/mlx5_sf.100
/sys/bus/virtbus/devices/mlx5_sf.2
(e) Scope of a sfnum is within the devlink instance who supports SF
lifecycle and SF devlink port lifecycle.
This sfnum is populated in /sys/bus/virtbus/devices/100/sfnum
(f) Persistent name of Netdevice and RDMA device of a virtbus SF
device is prepared using parent device of SF and sfnum of an SF.
Such as,
/sys/bus/virtbus/devices/mlx5_sf.100 ->
../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/100
Netdev name for such SF = enp6s0f0s1
Where <en><parent_dev naming scheme> <s for sf> <sfnum_in_decimal>
This generates unique netdev name because parent is involved in the
device naming.
RDMA device persistent name will be done similarly such as
'rocep6s0f0s1'.
(g) SF devlink instance name is prepared using SF's parent bus/device
and sfnum. Such as, pci/0000.06.00.0%sf1.
This scheme ensures that SF's devlink device name remains
predictable using sfnum regardless of dynamic virtbus device
name/index.
(h) Each virtbus SF device driver id is defined by virtbus core.
This driver id is used to bind virtbus SF device with the SF
driver which has matching driver-id provided during driver
registration time.
This is further visible via modpost tools too.
(i) A new devlink eswitch port flavour named 'pcisf' is introduced for
PCI SF devices similar to existing flavour 'pcivf' for PCI VFs.
(j) devlink eswitch port for PCI SF is added/deleted whenever SF slice
is added/deleted.
(k) SF representor netdev phys_port_name=pf0sf1
Format is: pf<pf_number>sf<user_assigned_sfnum>
Examples:
---------
- Add a slice of flavour SF with sf identifier 4 on eswitch system:
$ devlink slice add pci/0000.06.00.0 flavour pcisf pfnum 0 sfnum 4 index 100
- Show a newly added slice on eswitch system:
$ devlink slice show pci/0000:06:00.0/100 -jp {
"slice": {
"pci/0000:06:00.0/100": {
"flavour": "pcisf",
"pfnum": 0,
"sfnum": 4,
"state" : "inactive",
"opstate" : "detached",
}
}
}
- Show eswitch port configuration for SF on eswitch system:
$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0 flavour physical port 0
pci/0000:06:00.0/8000: type eth netdev ens2f0sf1 flavour pcisf pfnum 0
sfnum 4
Here eswitch port index 8000 is assigned by the vendor driver.
- Activate SF slice to trigger creating a virtbus SF device in host system:
$ devlink slice set pci/0000.06.00.0/100 state active
- Query SF slice state on eswitch system:
$ devlink slice show pci/0000:06:00.0/100 -jp {
"slice": {
"pci/0000:06:00.0/100": {
"flavour": "pcisf",
"pfnum": 0,
"sfnum": 4,
"state": "active",
"opstate": "detached",
}
}
}
- Query SF slice state on eswitch system once driver is loaded on SF:
$ devlink slice show pci/0000:06:00.0/100 -jp {
"slice": {
"pci/0000:06:00.0/100": {
"flavour": "pcisf",
"pfnum": 0,
"sfnum": 4,
"state": "active",
"opstate": "attached",
}
}
}
- Show devlink instances on host system:
$ devlink dev show
pci/0000:06:00.0
pci/0000:06:00.0%sf4
- Show devlink ports on host system:
$ devlink port show
pci/0000:06:00.0/0: flavour physical type eth netdev netdev enp6s0f0
pci/0000:06:00.0%sf4/0: flavour virtual type eth netdev netdev enp6s0f0s10
- Mark a slice inactive, when slice may be in active state:
$ devlink slice set pci/0000.06.00.0/100 state inactive
- Query SF slice state on eswitch system once inactivation is triggered:
Output when it is in detaching state:
$ devlink slice show pci/0000:06:00.0/100 -jp {
"slice": {
"pci/0000:06:00.0/100": {
"flavour": "pcisf",
"pfnum": 0,
"sfnum": 4,
"state": "inactive",
"opstate": "detaching",
}
}
}
Output once detached:
$ devlink slice show pci/0000:06:00.0/100 -jp {
"slice": {
"pci/0000:06:00.0/100": {
"flavour": "pcisf",
"pfnum": 0,
"sfnum": 4,
"active": "inactive",
"opstate": "detached",
}
}
}
- Delete the slice which is recently inactivated.
$ devlink slice del pci/0000.06.00.0/100
FAQs:
----
1. What are the differences between PCI VF slice and PCI SF slice?
Ans:
(a) PCI SF slice lifecycle is driven by the user at individual slice
level.
PCI VF slice(s) are created and destroyed by the vendor driver
such as mlx5_core. Typically, this is done when SR-IOV is
enabled/disabled.
(b) PCI VF slices state cannot be changed by user as it is currently
not support by PCI core and vendor devices. They are always
created in active state.
PCI SF slice state is controlled by the user.
2. What are the similarities between PCI VF and PCI SF slice?
(a) Both slices have similar config attributes.
(b) Both slices have eswitch devlink port and representor netdevice.
3. What about handling VF slice similar to SF slice?
Ans:
Yes this is desired. When VF vendor device supports this mode,
There will be at least one difference between SF and VF
slice handling, i.e. SRIOV enablement will continue using sysfs on
host system. Once SRIOV is enabled, VF slice commands should
function in similar way as SF slice.
4. What are the similaries with SF slice and PF slice?
Ans:
(a) Both slices have similar config attributes.
(b) Both slices have eswitch devlink port and representor netdevice.
5. Can slice be used to have dynamic PF as slice?
Ans:
Yes. Whenever a vendor device support dynamic PF, same lifecycle,
APIs, attributes can be used for PF slice.
6. Can slice exist without a eswitch?
Ans: Yes
7. Why not overload devlink port instead of creating new object slice?
Ans:
In one implementation a devlink port can be overloaded/retrofit
what slice object wants to achieve.
Same reasoning can be applied to overload and retrofit netdevice to
achieve what a slice object wants to achieve.
However, it is more natural to create a new object (slice) that
represents a device for below rationale.
(a) Even though a devlink slice has devlink port attached to it,
it is narrow model to always have such association. It limits
the devlink to be used only with eswitch.
(b) slice undergoes state transitions from
create->config->activate->inactivate->delete.
It is weird to have few port flavours follow state transition
and few don't.
8. Why a bus is needed?
Ans:
(a) To get unique, persistent and deterministic names of netdev,
rdma dev of slice/sf.
(b) device lifecycle operates using similar pci bus and current
driver model. No need to invent new lifecycle scheme.
(c) To follow uniform device config model for VF and SFs, where
user must be able to configure the attributes before binding a
VF/SF to driver.
(d) In future, if needed a virtbus SF device can be bound to some
other driver similar to how a PCI PF/VF device can be bound to
mlx5_core or vfio_pci device.
(e) Reuse kernel's existing power management framework to
suspend/resume SF devices.
(f) When using SFs in smartnic based system where SF eswitch port and
SF slice are located on two different systems, user desire to
configure eswitch before activating the SF slice device.
Bus allows to follow existing driver model for above needs as,
create->configure_multiple_attributes->deploy.
9. Why not further extend (or abuse) mdev bus?
Ans: Because
(a) mdev and vfio are coupled together from iommu perspective
(b) mdev shouldn't be further abuse to use mdev in current state
as suggested by Greg-KH, Jason Gunthrope.
(c) One bus for "sw_api" purpose, hence the new virtbus.
(d) Few don't like weird guid/uuid of mdev for sub function purpose
(e) If needed to map a SF to VM using mdev, new mdev slice flavour
should be created for lifecycling via devlink or may be continue
life cycle via mdev sysfs; but do remaining slice handling via
devlink similar to how PCI VF slices are managed.
10. Is it ok to reuse virtbus used for matching service?
Ans: Yes.
Greg-KH guided in [1] that its ok for virtbus to hold devices with
different attributes. Such as matching services vs SF devices where
SF device will have
(a) PCI BAR resource info
(b) IRQ info
(c) sfnum
Greg-KH also guided in [1] that its ok to anchor netdev and rdma dev
of a SF device to SF virtbus device while rdma and netdev of matching
service virtbus device to anchored at the parent PCI device.
[1] https://www.spinics.net/lists/linux-rdma/msg87124.html
11. Why platform and mfd framework is not used for creating slices?
Ans:
(a) As platform documentation clearly describes, platform devices
are for a specific platform. They are autonomous devices, unlike
user created slices.
(b) slices are dynamic in nature where each individual slice is
created/destroyed independently.
(c) MFD (multi-function) devices are for a device that comprise
more than one non-unique yet varying hardware functionality.
While, here each slice is of same type as that of parent device,
with less capabilities than the parent device. Given that mfd
devices are built on top of platform devices, it inherits similar
limitations as that of platform device.
(d) In few other threads Greg-KH said to not (ab)use platform
devices for such purpose.
(e) Given that for networking related slices, slice is linked to
eswitch which is managed by devlink, having lifecycle through
devlink, improves locking/synchronization to single subsystem.
12. A given slice supports multiple class of devices, such as RDMA,
netdev, vdpa device. How can I disable/enable such attributes of
a slice?
Ans:
Slice attributes should be extended in vendor neutral and also
vendor specific way to enable/disable such attributes depending
on attribute type. This should be handled in future patches.
virtbus limitations:
--------------------
1. virtbus bus will not support IOMMU like how pci bus does.
Hence, all protocol devices (such as netdev, rdmadev) must use its
parent's DMA device.
RDMA subsystem will use ib_device->dma_device with existing ib_dma_
wrapper.
netdev will use mlx5_core_dev->pci_dev for dma purposes.
This is suggested/hinted by Christoph and Jason.
Why is it this way?
(a) Because currently only rdma and netdev intend to use it.
(b) In future if more use case arise, virtbus device can share DMAR
and group of same parent PCI device for in-kernel usecase.
2. Dedicated or shared irq vector(s) assignment per sub function and
its exposure in sysfs will not be supported in initial version.
It will be supported in future series.
3. PCI BAR resource information as resource files in sysfs will not be
supported in initial version.
It will be supported in future series.
Post initial series:
--------------------
Following plumbing will be done post initial series.
1. Max number of SF slices resource at devlink level
2. Max IRQ vectors resource config at devlink level
3. sf lifecycle for kernel vDPA support, still under discussion
4. systemd/udev user space patches for persistent naming for
netdev and rdma
Out-of-scope:
-------------
1. Creating mdev/vhost backend/vdpa devices using devlink slice APIs.
To support vhost backend 'struct device' creation, SF slice
create/set API can be extended to enable/disable specific
capabilities of the slice. Such as enable/disable kernel vDPA
feature or enable/disable RDMA device for a SF slice.
2. Similar extension is applicable for a VF slice.
3. Multi host support is orthogonal to this and it can be extended
in future.
Example software/system view:
-----------------------------
_______
| admin |
| user |----------
|_______| |
| |
____|____ __|______ _____________
| | | | | |
| devlink | | ovs | | user |
| tool | |_________| | application |
|_________| | |_____________|
| | | |
-----------|-------------|-------------------|-------|-----------
| | +----------+ +----------+
| | | netdev | | rdma dev |
| | +----------+ +----------+
(slice cmds, | ^ ^
add/del/set) | | |
| | +-------------|
_____|___ | | ____|________
| | | | | |
| devlink | +------------+ | | mlx5_core/ib|
| kernel | | rep netdev | | | drivers |
|_________| +------------+ | |_____________|
| | | ^
(slice cmds) | | (probe/remove)
_____|____ | | ____|_____
| | | +--------------+ | |
| mlx5_core|--------- | virtbus dev |---| virtbus |
| driver | +--------------+ | driver |
|__________| |__________|
| ^
(sf add/del, events) |
| (device add/del)
_____|____ ____|_____
| | | |
| PCI NIC |---- admin activate/deactive | mlx5_core|
|__________| deactive events ---->| driver |
|__________|
Powered by blists - more mailing lists