[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AM0PR05MB4866BAF932333581BA7E0C5FD1C00@AM0PR05MB4866.eurprd05.prod.outlook.com>
Date: Wed, 8 Apr 2020 06:10:19 +0000
From: Parav Pandit <parav@...lanox.com>
To: Parav Pandit <parav@...lanox.com>,
Jakub Kicinski <kuba@...nel.org>,
Saeed Mahameed <saeedm@...lanox.com>
CC: "sridhar.samudrala@...el.com" <sridhar.samudrala@...el.com>,
Aya Levin <ayal@...lanox.com>,
"andrew.gospodarek@...adcom.com" <andrew.gospodarek@...adcom.com>,
"sburla@...vell.com" <sburla@...vell.com>,
"jiri@...nulli.us" <jiri@...nulli.us>,
Tariq Toukan <tariqt@...lanox.com>,
"davem@...emloft.net" <davem@...emloft.net>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Vlad Buslov <vladbu@...lanox.com>,
"lihong.yang@...el.com" <lihong.yang@...el.com>,
Ido Schimmel <idosch@...lanox.com>,
"jgg@...pe.ca" <jgg@...pe.ca>,
"fmanlunas@...vell.com" <fmanlunas@...vell.com>,
"oss-drivers@...ronome.com" <oss-drivers@...ronome.com>,
"leon@...nel.org" <leon@...nel.org>,
"grygorii.strashko@...com" <grygorii.strashko@...com>,
"michael.chan@...adcom.com" <michael.chan@...adcom.com>,
Alex Vesker <valex@...lanox.com>,
"snelson@...sando.io" <snelson@...sando.io>,
"linyunsheng@...wei.com" <linyunsheng@...wei.com>,
"magnus.karlsson@...el.com" <magnus.karlsson@...el.com>,
"dchickles@...vell.com" <dchickles@...vell.com>,
"jacob.e.keller@...el.com" <jacob.e.keller@...el.com>,
Moshe Shemesh <moshe@...lanox.com>,
Mark Zhang <markz@...lanox.com>,
"aelior@...vell.com" <aelior@...vell.com>,
Yuval Avnery <yuvalav@...lanox.com>,
"drivers@...sando.io" <drivers@...sando.io>,
mlxsw <mlxsw@...lanox.com>,
"GR-everest-linux-l2@...vell.com" <GR-everest-linux-l2@...vell.com>,
Yevgeny Kliteynik <kliteyn@...lanox.com>,
"vikas.gupta@...adcom.com" <vikas.gupta@...adcom.com>,
Eran Ben Elisha <eranbe@...lanox.com>
Subject: RE: [RFC] current devlink extension plan for NICs
Hi Saeed,
> From: netdev-owner@...r.kernel.org <netdev-owner@...r.kernel.org> On
> Behalf Of Parav Pandit
> On 3/28/2020 2:12 AM, Jakub Kicinski wrote:
> > On Fri, 27 Mar 2020 19:45:53 +0000 Saeed Mahameed wrote:
>
> >> from what i understand, a real slice is a full isolated HW pipeline
> >> with its own HW resources and HW based isolation, a slice rings/hw
> >> resources can never be shared between different slices, just like a
> >> vf, but without the pcie virtual function back-end..
>
> >> We need a clear-cut definition of what a Sub-function slice is.. this
> >> RFC doesn't seem to address that clearly.
> >
Did you get a chance to review below content?
Can you please confirm if below description describes, what you were looking for?
> > Definitely. I'd say we need a clear definition of (a) what a
> > sub-functions is, and (b) what a slice is.
> >
>
> Below is the extended contents for Jiri's RFC that addresses Saeed point on
> clear definition of slice, sub-function and their plumbing.
> Few things already defined by Jiri's RFC such as diagrams, examples, but
> putting it all together below for completeness.
>
> This RFC extension covers overall requirements, design, alternatives
> considered and rationale for current design choices to support sub-functions
> using slices.
>
> Overview:
> ---------
> A user wants to use multiple netdev and rdma devices of a PCI device
> without using PCI SR-IOV. This is done by creating slices of a PCI device using
> devlink tool.
>
> Requirements:
> -------------
> 1. Able to make multiple slices of a device 2. Able to provision such slice to
> use in-kernel 3. Able to have persistent name of netdev and rdma device of a
> slice 4. Reuse current power management (suspend/resume) kernel infra for
> slice devices; not reinvent locking etc in each driver or devlink 5. Able to
> control, steer, offload network traffic of slices using
> existing devlink eswitch switchdev support 6. Suddenly not change existing
> pci pf/vf naming scheme by introducing
> slice
> 7. Support slice of a device in an iommu enabled kernel 8. Dynamically
> create/delete a slice 9. Ability to configure a slice before deploying it 10.
> Get/Set slice attributes like PCI VF slice attributes 11. Reuse/extend virtbus
> for carrying SF slice devices 12. Hot-plug a slice device in host system from
> the NIC
> (eswitch system) side without running any agent in the
> host system
> 13. Have unified interface for slice management regardless of
> deploying on eswitch system or on host system.
> 14. User must be able to create a portion of the device from eswitch
> system and attach to the untrusted host system. This host system
> is not accessible to eswitch system for purpose of device
> life-cycle and initial configuration.
>
> Slice:
> ------
> A slice represents a portion of the device. A slice is a generic object that
> represents either a PCI VF or PCI PF or PCI sub function (SF) described below.
>
> Sub-function (SF):
> ------------------
> - An sub-function is a portion of the PCI device which supports multiple
> class of devices such as netdev, rdma and more.
> - An SF netdev has its own dedicated queues(txq, rxq).
> - An SF rdma device has its own QP1, GID table and rdma resources.
> An SF rdma resources has its own resource namespace.
> - An SF supports eswitch representation and full offload support
> when it is running with eswitch support.
> - User must configure eswitch to send/receive packets for an SF.
> - An SF shares PCI level resources with other SFs and/or with its
> parent PCI function.
> For example, an SF shares IRQ vectors with other SFs and its
> PCI function.
> In future it may have dedicated IRQ vector per SF.
> An SF has dedicated window in PCI BAR space that is not shared
> with other SFs or PF. This ensures that when a SF is assigned to
> an application, only that application can access device resources.
>
> Overall design:
> ---------------
> A new flavour of slice is created that represents a portion of the device. It is
> equivalent to a PCI VF slice, but new slice exists without PCI SR-IOV. This can
> scale to possibly more than number of SR-IOV VFs. A new slice flavour 'pcisf'
> is introduced which is explained later in this document.
>
> devlink subsystem is extended to create, delete and deploy a slice using a
> devlink instance.
>
> Slice life cycle is done using devlink commands explained below.
>
> (a) Lifecycle command consist of 3 main commands i.e. add, delete and
> state change.
> (b) Add/delete commands create or delete the slice of a device
> respectively.
> (c) A slice undergoes one or more configuration before it is
> activated. This may include,
> (a) slice hardware address configuration
> (b) representor netdevice configuration
> (c) Network policy, steering configuration through representor
> (d) Once a slice is fully configured, user activates it.
> Slice activation triggers device enumeration and binding to the
> driver.
> (e) Each slice goes through a state transition during its life cycle.
> (f) Each slice's admin state is controlled by the user. Slice
> operational is updated based on driver attach/detach tasks on the
> host system.
>
> such as,
> admin state description
> ----------- ------------
> 1. inactive Slice enter this state when it is newly created.
> User typically does most of the configuration in
> this state before activating the slice.
>
> 2. active State when slice is just activated by user.
>
> operational description
> state
> ----------- ------------
> 1. attached State when slice device is bound to the host
> driver. When the slice device is unbound from the
> host driver, slice device exits this state and
> enters detaching state.
>
> 2. detaching State when host is notified to deactivate the
> slice device and slice device may be undergoing
> detachment from host driver. When slice device is
> fully detached from the host driver, slice exits
> this state and enters detached state.
>
> 3. detached State when driver is fully unbound, it enters
> into detached state.
>
> slice state machine:
> --------------------
> slice state set inactive
> ----<------------------<---
> | or slice delete |
> | |
> __________ ____|_______ ____|_______
> | | slice add | |slice state | |
> | |-------->---| |------>-----| |
> | invalid | | inactive | set active | active |
> | | slice del | | | |
> |__________|--<---------|____________| |____________|
>
> slice device operational state machine:
> ---------------------------------------
> __________ ____________ ___________
> | | slice state | |driver bus | |
> | invalid |-------->-----| detached |------>-----| attached |
> | | set active | | probe() | |
> |__________| |____________| |___________|
> | |
> ^ slice set
> | set inactive
> successful detach |
> or pf reset |
> ____|_______ |
> | | driver bus |
> -----------| detaching |---<-------------
> | | | remove()
> ^ |____________|
> | timeout |
> --<---------------
>
> More on the slice states:
> (a) Primary reason to run slice life cycle this way is: a user must
> be able to create and configure a slice on eswitch system, before
> host system discovers the slice device. Hence, a slice
> device will not be accessible from the host system until it is
> explicitly activated by the user. Typically this is done after
> all necessary slice attributes are configured.
> (b) A user wants to create and configure the slice on eswitch system
> when host system is power-down state.
> (c) A slice interface for sub-functions should be uniform regardless
> of slice devices are deployed in eswitch system or host system.
> (d) If a host system software where slice is deployed is compromised,
> which doesn't detach the slice, such slice remains in detaching
> state until PF driver is reset on the host. Such slice won't be
> usable until it is detached gracefully by host software.
> Forcefully changing its state and reusing it can lead to
> unexpected behavior and access of slice resources. Hence when
> opstate = detaching and state = inactive, such slice cannot be
> activated.
>
> SF sysfs, virtbus and devlink plumbing:
> ---------------------------------------
> (a) Given that a sub-function device may have PCI BAR resource(s), power
> management, IRQ configuration, persistence naming etc, a clear sysfs
> view like existing PCI device is desired.
>
> Therefore, an SF resides on a new bus called virtbus.
> This virtbus holds one or more SFs of a parent PCI device.
>
> Whenever user activates a slice, corresponding sub-function device
> is created on the virtbus and attached to the driver.
>
> (b) Each SF has user defined unique number associated with it,
> called 'sfnum'. This sfnum is provided during SF slice creation
> time. Multiple uses of the sfnum is explained in detail below.
> (c) An active SF slice has a unique 'struct device' anchored on
> virtbus. An SF is identified using unique name on the virtbus.
> (d) An SF's device (sysfs) name is created using ida assigned by the
> virtbus core. such as,
> /sys/bus/virtbus/devices/mlx5_sf.100
> /sys/bus/virtbus/devices/mlx5_sf.2
> (e) Scope of a sfnum is within the devlink instance who supports SF
> lifecycle and SF devlink port lifecycle.
> This sfnum is populated in /sys/bus/virtbus/devices/100/sfnum
> (f) Persistent name of Netdevice and RDMA device of a virtbus SF
> device is prepared using parent device of SF and sfnum of an SF.
> Such as,
> /sys/bus/virtbus/devices/mlx5_sf.100 ->
> ../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/100
> Netdev name for such SF = enp6s0f0s1
> Where <en><parent_dev naming scheme> <s for sf> <sfnum_in_decimal>
> This generates unique netdev name because parent is involved in the
> device naming.
> RDMA device persistent name will be done similarly such as
> 'rocep6s0f0s1'.
> (g) SF devlink instance name is prepared using SF's parent bus/device
> and sfnum. Such as, pci/0000.06.00.0%sf1.
> This scheme ensures that SF's devlink device name remains
> predictable using sfnum regardless of dynamic virtbus device
> name/index.
> (h) Each virtbus SF device driver id is defined by virtbus core.
> This driver id is used to bind virtbus SF device with the SF
> driver which has matching driver-id provided during driver
> registration time.
> This is further visible via modpost tools too.
> (i) A new devlink eswitch port flavour named 'pcisf' is introduced for
> PCI SF devices similar to existing flavour 'pcivf' for PCI VFs.
> (j) devlink eswitch port for PCI SF is added/deleted whenever SF slice
> is added/deleted.
> (k) SF representor netdev phys_port_name=pf0sf1
> Format is: pf<pf_number>sf<user_assigned_sfnum>
>
> Examples:
> ---------
> - Add a slice of flavour SF with sf identifier 4 on eswitch system:
>
> $ devlink slice add pci/0000.06.00.0 flavour pcisf pfnum 0 sfnum 4 index 100
>
> - Show a newly added slice on eswitch system:
> $ devlink slice show pci/0000:06:00.0/100 -jp {
> "slice": {
> "pci/0000:06:00.0/100": {
> "flavour": "pcisf",
> "pfnum": 0,
> "sfnum": 4,
> "state" : "inactive",
> "opstate" : "detached",
> }
> }
> }
>
> - Show eswitch port configuration for SF on eswitch system:
> $ devlink port show
> pci/0000:06:00.0/65535: type eth netdev ens2f0 flavour physical port 0
> pci/0000:06:00.0/8000: type eth netdev ens2f0sf1 flavour pcisf pfnum 0
> sfnum 4 Here eswitch port index 8000 is assigned by the vendor driver.
>
> - Activate SF slice to trigger creating a virtbus SF device in host system:
> $ devlink slice set pci/0000.06.00.0/100 state active
>
> - Query SF slice state on eswitch system:
> $ devlink slice show pci/0000:06:00.0/100 -jp {
> "slice": {
> "pci/0000:06:00.0/100": {
> "flavour": "pcisf",
> "pfnum": 0,
> "sfnum": 4,
> "state": "active",
> "opstate": "detached",
> }
> }
> }
>
> - Query SF slice state on eswitch system once driver is loaded on SF:
> $ devlink slice show pci/0000:06:00.0/100 -jp {
> "slice": {
> "pci/0000:06:00.0/100": {
> "flavour": "pcisf",
> "pfnum": 0,
> "sfnum": 4,
> "state": "active",
> "opstate": "attached",
> }
> }
> }
>
> - Show devlink instances on host system:
> $ devlink dev show
> pci/0000:06:00.0
> pci/0000:06:00.0%sf4
>
> - Show devlink ports on host system:
> $ devlink port show
> pci/0000:06:00.0/0: flavour physical type eth netdev netdev enp6s0f0
> pci/0000:06:00.0%sf4/0: flavour virtual type eth netdev netdev enp6s0f0s10
>
> - Mark a slice inactive, when slice may be in active state:
> $ devlink slice set pci/0000.06.00.0/100 state inactive
>
> - Query SF slice state on eswitch system once inactivation is triggered:
> Output when it is in detaching state:
>
> $ devlink slice show pci/0000:06:00.0/100 -jp {
> "slice": {
> "pci/0000:06:00.0/100": {
> "flavour": "pcisf",
> "pfnum": 0,
> "sfnum": 4,
> "state": "inactive",
> "opstate": "detaching",
> }
> }
> }
>
> Output once detached:
> $ devlink slice show pci/0000:06:00.0/100 -jp {
> "slice": {
> "pci/0000:06:00.0/100": {
> "flavour": "pcisf",
> "pfnum": 0,
> "sfnum": 4,
> "active": "inactive",
> "opstate": "detached",
> }
> }
> }
>
> - Delete the slice which is recently inactivated.
> $ devlink slice del pci/0000.06.00.0/100
>
> FAQs:
> ----
> 1. What are the differences between PCI VF slice and PCI SF slice?
> Ans:
> (a) PCI SF slice lifecycle is driven by the user at individual slice
> level.
> PCI VF slice(s) are created and destroyed by the vendor driver
> such as mlx5_core. Typically, this is done when SR-IOV is
> enabled/disabled.
> (b) PCI VF slices state cannot be changed by user as it is currently
> not support by PCI core and vendor devices. They are always
> created in active state.
> PCI SF slice state is controlled by the user.
>
> 2. What are the similarities between PCI VF and PCI SF slice?
> (a) Both slices have similar config attributes.
> (b) Both slices have eswitch devlink port and representor netdevice.
>
> 3. What about handling VF slice similar to SF slice?
> Ans:
> Yes this is desired. When VF vendor device supports this mode,
> There will be at least one difference between SF and VF
> slice handling, i.e. SRIOV enablement will continue using sysfs on
> host system. Once SRIOV is enabled, VF slice commands should
> function in similar way as SF slice.
>
> 4. What are the similaries with SF slice and PF slice?
> Ans:
> (a) Both slices have similar config attributes.
> (b) Both slices have eswitch devlink port and representor netdevice.
>
> 5. Can slice be used to have dynamic PF as slice?
> Ans:
> Yes. Whenever a vendor device support dynamic PF, same lifecycle,
> APIs, attributes can be used for PF slice.
>
> 6. Can slice exist without a eswitch?
> Ans: Yes
>
> 7. Why not overload devlink port instead of creating new object slice?
> Ans:
> In one implementation a devlink port can be overloaded/retrofit
> what slice object wants to achieve.
> Same reasoning can be applied to overload and retrofit netdevice to
> achieve what a slice object wants to achieve.
> However, it is more natural to create a new object (slice) that
> represents a device for below rationale.
> (a) Even though a devlink slice has devlink port attached to it,
> it is narrow model to always have such association. It limits
> the devlink to be used only with eswitch.
> (b) slice undergoes state transitions from
> create->config->activate->inactivate->delete.
> It is weird to have few port flavours follow state transition
> and few don't.
>
> 8. Why a bus is needed?
> Ans:
> (a) To get unique, persistent and deterministic names of netdev,
> rdma dev of slice/sf.
> (b) device lifecycle operates using similar pci bus and current
> driver model. No need to invent new lifecycle scheme.
> (c) To follow uniform device config model for VF and SFs, where
> user must be able to configure the attributes before binding a
> VF/SF to driver.
> (d) In future, if needed a virtbus SF device can be bound to some
> other driver similar to how a PCI PF/VF device can be bound to
> mlx5_core or vfio_pci device.
> (e) Reuse kernel's existing power management framework to
> suspend/resume SF devices.
> (f) When using SFs in smartnic based system where SF eswitch port and
> SF slice are located on two different systems, user desire to
> configure eswitch before activating the SF slice device.
> Bus allows to follow existing driver model for above needs as,
> create->configure_multiple_attributes->deploy.
>
> 9. Why not further extend (or abuse) mdev bus?
> Ans: Because
> (a) mdev and vfio are coupled together from iommu perspective
> (b) mdev shouldn't be further abuse to use mdev in current state
> as suggested by Greg-KH, Jason Gunthrope.
> (c) One bus for "sw_api" purpose, hence the new virtbus.
> (d) Few don't like weird guid/uuid of mdev for sub function purpose
> (e) If needed to map a SF to VM using mdev, new mdev slice flavour
> should be created for lifecycling via devlink or may be continue
> life cycle via mdev sysfs; but do remaining slice handling via
> devlink similar to how PCI VF slices are managed.
>
> 10. Is it ok to reuse virtbus used for matching service?
> Ans: Yes.
> Greg-KH guided in [1] that its ok for virtbus to hold devices with different
> attributes. Such as matching services vs SF devices where SF device will have
> (a) PCI BAR resource info
> (b) IRQ info
> (c) sfnum
>
> Greg-KH also guided in [1] that its ok to anchor netdev and rdma dev of a SF
> device to SF virtbus device while rdma and netdev of matching service
> virtbus device to anchored at the parent PCI device.
>
> [1] https://www.spinics.net/lists/linux-rdma/msg87124.html
>
> 11. Why platform and mfd framework is not used for creating slices?
> Ans:
> (a) As platform documentation clearly describes, platform devices
> are for a specific platform. They are autonomous devices, unlike
> user created slices.
> (b) slices are dynamic in nature where each individual slice is
> created/destroyed independently.
> (c) MFD (multi-function) devices are for a device that comprise
> more than one non-unique yet varying hardware functionality.
> While, here each slice is of same type as that of parent device,
> with less capabilities than the parent device. Given that mfd
> devices are built on top of platform devices, it inherits similar
> limitations as that of platform device.
> (d) In few other threads Greg-KH said to not (ab)use platform
> devices for such purpose.
> (e) Given that for networking related slices, slice is linked to
> eswitch which is managed by devlink, having lifecycle through
> devlink, improves locking/synchronization to single subsystem.
>
> 12. A given slice supports multiple class of devices, such as RDMA,
> netdev, vdpa device. How can I disable/enable such attributes of
> a slice?
> Ans:
> Slice attributes should be extended in vendor neutral and also
> vendor specific way to enable/disable such attributes depending
> on attribute type. This should be handled in future patches.
>
> virtbus limitations:
> --------------------
> 1. virtbus bus will not support IOMMU like how pci bus does.
> Hence, all protocol devices (such as netdev, rdmadev) must use its
> parent's DMA device.
> RDMA subsystem will use ib_device->dma_device with existing ib_dma_
> wrapper.
> netdev will use mlx5_core_dev->pci_dev for dma purposes.
> This is suggested/hinted by Christoph and Jason.
>
> Why is it this way?
> (a) Because currently only rdma and netdev intend to use it.
> (b) In future if more use case arise, virtbus device can share DMAR
> and group of same parent PCI device for in-kernel usecase.
>
> 2. Dedicated or shared irq vector(s) assignment per sub function and
> its exposure in sysfs will not be supported in initial version.
> It will be supported in future series.
>
> 3. PCI BAR resource information as resource files in sysfs will not be
> supported in initial version.
> It will be supported in future series.
>
> Post initial series:
> --------------------
> Following plumbing will be done post initial series.
>
> 1. Max number of SF slices resource at devlink level 2. Max IRQ vectors
> resource config at devlink level 3. sf lifecycle for kernel vDPA support, still
> under discussion 4. systemd/udev user space patches for persistent naming
> for
> netdev and rdma
>
> Out-of-scope:
> -------------
> 1. Creating mdev/vhost backend/vdpa devices using devlink slice APIs.
> To support vhost backend 'struct device' creation, SF slice
> create/set API can be extended to enable/disable specific
> capabilities of the slice. Such as enable/disable kernel vDPA
> feature or enable/disable RDMA device for a SF slice.
> 2. Similar extension is applicable for a VF slice.
> 3. Multi host support is orthogonal to this and it can be extended
> in future.
>
> Example software/system view:
> -----------------------------
> _______
> | admin |
> | user |----------
> |_______| |
> | |
> ____|____ __|______ _____________
> | | | | | |
> | devlink | | ovs | | user |
> | tool | |_________| | application |
> |_________| | |_____________|
> | | | |
> -----------|-------------|-------------------|-------|-----------
> | | +----------+ +----------+
> | | | netdev | | rdma dev |
> | | +----------+ +----------+
> (slice cmds, | ^ ^
> add/del/set) | | |
> | | +-------------|
> _____|___ | | ____|________
> | | | | | |
> | devlink | +------------+ | | mlx5_core/ib|
> | kernel | | rep netdev | | | drivers |
> |_________| +------------+ | |_____________|
> | | | ^
> (slice cmds) | | (probe/remove)
> _____|____ | | ____|_____
> | | | +--------------+ | |
> | mlx5_core|--------- | virtbus dev |---| virtbus |
> | driver | +--------------+ | driver |
> |__________| |__________|
> | ^
> (sf add/del, events) |
> | (device add/del)
> _____|____ ____|_____
> | | | |
> | PCI NIC |---- admin activate/deactive | mlx5_core|
> |__________| deactive events ---->| driver |
> |__________|
>
Powered by blists - more mailing lists