lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7d2b9a8e-07b3-093c-95d1-ab165b56c903@mellanox.com>
Date:   Mon, 30 Mar 2020 09:07:47 +0000
From:   Parav Pandit <parav@...lanox.com>
To:     Jakub Kicinski <kuba@...nel.org>,
        Saeed Mahameed <saeedm@...lanox.com>
CC:     "sridhar.samudrala@...el.com" <sridhar.samudrala@...el.com>,
        Aya Levin <ayal@...lanox.com>,
        "andrew.gospodarek@...adcom.com" <andrew.gospodarek@...adcom.com>,
        "sburla@...vell.com" <sburla@...vell.com>,
        "jiri@...nulli.us" <jiri@...nulli.us>,
        Tariq Toukan <tariqt@...lanox.com>,
        "davem@...emloft.net" <davem@...emloft.net>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Vlad Buslov <vladbu@...lanox.com>,
        "lihong.yang@...el.com" <lihong.yang@...el.com>,
        Ido Schimmel <idosch@...lanox.com>,
        "jgg@...pe.ca" <jgg@...pe.ca>,
        "fmanlunas@...vell.com" <fmanlunas@...vell.com>,
        "oss-drivers@...ronome.com" <oss-drivers@...ronome.com>,
        "leon@...nel.org" <leon@...nel.org>,
        "grygorii.strashko@...com" <grygorii.strashko@...com>,
        "michael.chan@...adcom.com" <michael.chan@...adcom.com>,
        Alex Vesker <valex@...lanox.com>,
        "snelson@...sando.io" <snelson@...sando.io>,
        "linyunsheng@...wei.com" <linyunsheng@...wei.com>,
        "magnus.karlsson@...el.com" <magnus.karlsson@...el.com>,
        "dchickles@...vell.com" <dchickles@...vell.com>,
        "jacob.e.keller@...el.com" <jacob.e.keller@...el.com>,
        Moshe Shemesh <moshe@...lanox.com>,
        Mark Zhang <markz@...lanox.com>,
        "aelior@...vell.com" <aelior@...vell.com>,
        Yuval Avnery <yuvalav@...lanox.com>,
        "drivers@...sando.io" <drivers@...sando.io>,
        mlxsw <mlxsw@...lanox.com>,
        "GR-everest-linux-l2@...vell.com" <GR-everest-linux-l2@...vell.com>,
        Yevgeny Kliteynik <kliteyn@...lanox.com>,
        "vikas.gupta@...adcom.com" <vikas.gupta@...adcom.com>,
        Eran Ben Elisha <eranbe@...lanox.com>
Subject: Re: [RFC] current devlink extension plan for NICs

On 3/28/2020 2:12 AM, Jakub Kicinski wrote:
> On Fri, 27 Mar 2020 19:45:53 +0000 Saeed Mahameed wrote:

>> from what i understand, a real slice is a full isolated HW pipeline
>> with its own HW resources and HW based isolation, a slice rings/hw
>> resources can never be shared between different slices, just like a vf,
>> but without the pcie virtual function back-end..

>> We need a clear-cut definition of what a Sub-function slice is.. this
>> RFC doesn't seem to address that clearly.
> 
> Definitely. I'd say we need a clear definition of (a) what a
> sub-functions is, and (b) what a slice is.
> 

Below is the extended contents for Jiri's RFC that addresses Saeed point
on clear definition of slice, sub-function and their plumbing.
Few things already defined by Jiri's RFC such as diagrams, examples, but
putting it all together below for completeness.

This RFC extension covers overall requirements, design, alternatives
considered and rationale for current design choices to support
sub-functions using slices.

Overview:
---------
A user wants to use multiple netdev and rdma devices of a PCI device
without using PCI SR-IOV. This is done by creating slices of a
PCI device using devlink tool.

Requirements:
-------------
1. Able to make multiple slices of a device
2. Able to provision such slice to use in-kernel
3. Able to have persistent name of netdev and rdma device of a slice
4. Reuse current power management (suspend/resume) kernel infra for
   slice devices; not reinvent locking etc in each driver or devlink
5. Able to control, steer, offload network traffic of slices using
   existing devlink eswitch switchdev support
6. Suddenly not change existing pci pf/vf naming scheme by introducing
   slice
7. Support slice of a device in an iommu enabled kernel
8. Dynamically create/delete a slice
9. Ability to configure a slice before deploying it
10. Get/Set slice attributes like PCI VF slice attributes
11. Reuse/extend virtbus for carrying SF slice devices
12. Hot-plug a slice device in host system from the NIC
    (eswitch system) side without running any agent in the
    host system
13. Have unified interface for slice management regardless of
    deploying on eswitch system or on host system.
14. User must be able to create a portion of the device from eswitch
    system and attach to the untrusted host system. This host system
    is not accessible to eswitch system for purpose of device
    life-cycle and initial configuration.

Slice:
------
A slice represents a portion of the device. A slice is a generic
object that represents either a PCI VF or PCI PF or
PCI sub function (SF) described below.

Sub-function (SF):
------------------
- An sub-function is a portion of the PCI device which supports multiple
  class of devices such as netdev, rdma and more.
- An SF netdev has its own dedicated queues(txq, rxq).
- An SF rdma device has its own QP1, GID table and rdma resources.
  An SF rdma resources has its own resource namespace.
- An SF supports eswitch representation and full offload support
  when it is running with eswitch support.
- User must configure eswitch to send/receive packets for an SF.
- An SF shares PCI level resources with other SFs and/or with its
  parent PCI function.
  For example, an SF shares IRQ vectors with other SFs and its
  PCI function.
  In future it may have dedicated IRQ vector per SF.
  An SF has dedicated window in PCI BAR space that is not shared
  with other SFs or PF. This ensures that when a SF is assigned to
  an application, only that application can access device resources.

Overall design:
---------------
A new flavour of slice is created that represents a portion of the
device. It is equivalent to a PCI VF slice, but new slice exists
without PCI SR-IOV. This can scale to possibly more than number
of SR-IOV VFs. A new slice flavour 'pcisf' is introduced which is
explained later in this document.

devlink subsystem is extended to create, delete and deploy a slice
using a devlink instance.

Slice life cycle is done using devlink commands explained below.

(a) Lifecycle command consist of 3 main commands i.e. add, delete and
    state change.
(b) Add/delete commands create or delete the slice of a device
    respectively.
(c) A slice undergoes one or more configuration before it is
    activated. This may include,
     (a) slice hardware address configuration
     (b) representor netdevice configuration
     (c) Network policy, steering configuration through representor
(d) Once a slice is fully configured, user activates it.
    Slice activation triggers device enumeration and binding to the
    driver.
(e) Each slice goes through a state transition during its life cycle.
(f) Each slice's admin state is controlled by the user. Slice
    operational is updated based on driver attach/detach tasks on the
    host system.

    such as,
    admin state            description
    -----------            ------------
    1. inactive    Slice enter this state when it is newly created.
                   User typically does most of the configuration in
                   this state before activating the slice.

    2. active      State when slice is just activated by user.

    operational            description
       state
    -----------            ------------
    1. attached    State when slice device is bound to the host
                   driver. When the slice device is unbound from the
                   host driver, slice device exits this state and
                   enters detaching state.

    2. detaching   State when host is notified to deactivate the
                   slice device and slice device may be undergoing
                   detachment from host driver. When slice device is
                   fully detached from the host driver, slice exits
                   this state and enters detached state.

    3. detached    State when driver is fully unbound, it enters
                   into detached state.

slice state machine:
--------------------
                               slice state set inactive
                              ----<------------------<---
                              | or  slice delete        |
                              |                         |
  __________              ____|_______              ____|_______
 |          | slice add  |            |slice state |            |
 |          |-------->---|            |------>-----|            |
 | invalid  |            |  inactive  | set active |   active   |
 |          | slice del  |            |            |            |
 |__________|--<---------|____________|            |____________|

slice device operational state machine:
---------------------------------------
  __________                ____________              ___________
 |          | slice state  |            |driver bus  |           |
 | invalid  |-------->-----|  detached  |------>-----| attached  |
 |          | set active   |            | probe()    |           |
 |__________|              |____________|            |___________|
                                 |                        |
                                 ^                    slice set
                                 |                    set inactive
                            successful detach             |
                              or pf reset                 |
                             ____|_______                 |
                            |            | driver bus     |
                 -----------| detaching  |---<-------------
                 |          |            | remove()
                 ^          |____________|
                 |   timeout      |
                 --<---------------

More on the slice states:
(a) Primary reason to run slice life cycle this way is: a user must
    be able to create and configure a slice on eswitch system, before
    host system discovers the slice device. Hence, a slice
    device will not be accessible from the host system until it is
    explicitly activated by the user. Typically this is done after
    all necessary slice attributes are configured.
(b) A user wants to create and configure the slice on eswitch system
    when host system is power-down state.
(c) A slice interface for sub-functions should be uniform regardless
    of slice devices are deployed in eswitch system or host system.
(d) If a host system software where slice is deployed is compromised,
    which doesn't detach the slice, such slice remains in detaching
    state until PF driver is reset on the host. Such slice won't be
    usable until it is detached gracefully by host software.
    Forcefully changing its state and reusing it can lead to
    unexpected behavior and access of slice resources. Hence when
    opstate = detaching and state = inactive, such slice cannot be
    activated.

SF sysfs, virtbus and devlink plumbing:
---------------------------------------
(a) Given that a sub-function device may have PCI BAR resource(s), power
    management, IRQ configuration, persistence naming etc, a clear sysfs
    view like existing PCI device is desired.

    Therefore, an SF resides on a new bus called virtbus.
    This virtbus holds one or more SFs of a parent PCI device.

    Whenever user activates a slice, corresponding sub-function device
    is created on the virtbus and attached to the driver.

(b) Each SF has user defined unique number associated with it,
    called 'sfnum'. This sfnum is provided during SF slice creation
    time. Multiple uses of the sfnum is explained in detail below.
(c) An active SF slice has a unique 'struct device' anchored on
    virtbus. An SF is identified using unique name on the virtbus.
(d) An SF's device (sysfs) name is created using ida assigned by the
    virtbus core. such as,
    /sys/bus/virtbus/devices/mlx5_sf.100
    /sys/bus/virtbus/devices/mlx5_sf.2
(e) Scope of a sfnum is within the devlink instance who supports SF
    lifecycle and SF devlink port lifecycle.
    This sfnum is populated in /sys/bus/virtbus/devices/100/sfnum
(f) Persistent name of Netdevice and RDMA device of a virtbus SF
    device is prepared using parent device of SF and sfnum of an SF.
    Such as,
    /sys/bus/virtbus/devices/mlx5_sf.100 ->
../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/100
    Netdev name for such SF = enp6s0f0s1
    Where <en><parent_dev naming scheme> <s for sf> <sfnum_in_decimal>
    This generates unique netdev name because parent is involved in the
    device naming.
    RDMA device persistent name will be done similarly such as
    'rocep6s0f0s1'.
(g) SF devlink instance name is prepared using SF's parent bus/device
    and sfnum. Such as, pci/0000.06.00.0%sf1.
    This scheme ensures that SF's devlink device name remains
    predictable using sfnum regardless of dynamic virtbus device
    name/index.
(h) Each virtbus SF device driver id is defined by virtbus core.
    This driver id is used to bind virtbus SF device with the SF
    driver which has matching driver-id provided during driver
    registration time.
    This is further visible via modpost tools too.
(i) A new devlink eswitch port flavour named 'pcisf' is introduced for
    PCI SF devices similar to existing flavour 'pcivf' for PCI VFs.
(j) devlink eswitch port for PCI SF is added/deleted whenever SF slice
    is added/deleted.
(k) SF representor netdev phys_port_name=pf0sf1
    Format is: pf<pf_number>sf<user_assigned_sfnum>

Examples:
---------
- Add a slice of flavour SF with sf identifier 4 on eswitch system:

$ devlink slice add pci/0000.06.00.0 flavour pcisf pfnum 0 sfnum 4 index 100

- Show a newly added slice on eswitch system:
$ devlink slice show pci/0000:06:00.0/100 -jp {
    "slice": {
        "pci/0000:06:00.0/100": {
            "flavour": "pcisf",
            "pfnum": 0,
            "sfnum": 4,
            "state" : "inactive",
            "opstate" : "detached",
        }
    }
  }

- Show eswitch port configuration for SF on eswitch system:
$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0 flavour physical port 0
pci/0000:06:00.0/8000: type eth netdev ens2f0sf1 flavour pcisf pfnum 0
sfnum 4
Here eswitch port index 8000 is assigned by the vendor driver.

- Activate SF slice to trigger creating a virtbus SF device in host system:
$ devlink slice set pci/0000.06.00.0/100 state active

- Query SF slice state on eswitch system:
$ devlink slice show pci/0000:06:00.0/100 -jp {
    "slice": {
        "pci/0000:06:00.0/100": {
            "flavour": "pcisf",
            "pfnum": 0,
            "sfnum": 4,
            "state": "active",
            "opstate": "detached",
        }
    }
  }

- Query SF slice state on eswitch system once driver is loaded on SF:
$ devlink slice show pci/0000:06:00.0/100 -jp {
    "slice": {
        "pci/0000:06:00.0/100": {
            "flavour": "pcisf",
            "pfnum": 0,
            "sfnum": 4,
            "state": "active",
            "opstate": "attached",
        }
    }
  }

- Show devlink instances on host system:
$ devlink dev show
pci/0000:06:00.0
pci/0000:06:00.0%sf4

- Show devlink ports on host system:
$ devlink port show
pci/0000:06:00.0/0: flavour physical type eth netdev netdev enp6s0f0
pci/0000:06:00.0%sf4/0: flavour virtual type eth netdev netdev enp6s0f0s10

- Mark a slice inactive, when slice may be in active state:
$ devlink slice set pci/0000.06.00.0/100 state inactive

- Query SF slice state on eswitch system once inactivation is triggered:
Output when it is in detaching state:

$ devlink slice show pci/0000:06:00.0/100 -jp {
    "slice": {
        "pci/0000:06:00.0/100": {
            "flavour": "pcisf",
            "pfnum": 0,
            "sfnum": 4,
            "state": "inactive",
            "opstate": "detaching",
        }
    }
  }

Output once detached:
$ devlink slice show pci/0000:06:00.0/100 -jp {
    "slice": {
        "pci/0000:06:00.0/100": {
            "flavour": "pcisf",
            "pfnum": 0,
            "sfnum": 4,
            "active": "inactive",
            "opstate": "detached",
        }
    }
  }

- Delete the slice which is recently inactivated.
$ devlink slice del pci/0000.06.00.0/100

FAQs:
----
1. What are the differences between PCI VF slice and PCI SF slice?
Ans:
(a) PCI SF slice lifecycle is driven by the user at individual slice
    level.
    PCI VF slice(s) are created and destroyed by the vendor driver
    such as mlx5_core. Typically, this is done when SR-IOV is
    enabled/disabled.
(b) PCI VF slices state cannot be changed by user as it is currently
    not support by PCI core and vendor devices. They are always
    created in active state.
    PCI SF slice state is controlled by the user.

2. What are the similarities between PCI VF and PCI SF slice?
(a) Both slices have similar config attributes.
(b) Both slices have eswitch devlink port and representor netdevice.

3. What about handling VF slice similar to SF slice?
Ans:
   Yes this is desired. When VF vendor device supports this mode,
   There will be at least one difference between SF and VF
   slice handling, i.e. SRIOV enablement will continue using sysfs on
   host system. Once SRIOV is enabled, VF slice commands should
   function in similar way as SF slice.

4. What are the similaries with SF slice and PF slice?
Ans:
(a) Both slices have similar config attributes.
(b) Both slices have eswitch devlink port and representor netdevice.

5. Can slice be used to have dynamic PF as slice?
Ans:
   Yes. Whenever a vendor device support dynamic PF, same lifecycle,
   APIs, attributes can be used for PF slice.

6. Can slice exist without a eswitch?
Ans: Yes

7. Why not overload devlink port instead of creating new object slice?
Ans:
   In one implementation a devlink port can be overloaded/retrofit
   what slice object wants to achieve.
   Same reasoning can be applied to overload and retrofit netdevice to
   achieve what a slice object wants to achieve.
   However, it is more natural to create a new object (slice) that
   represents a device for below rationale.
   (a) Even though a devlink slice has devlink port attached to it,
       it is narrow model to always have such association. It limits
       the devlink to be used only with eswitch.
   (b) slice undergoes state transitions from
       create->config->activate->inactivate->delete.
       It is weird to have few port flavours follow state transition
       and few don't.

8. Why a bus is needed?
Ans:
(a) To get unique, persistent and deterministic names of netdev,
    rdma dev of slice/sf.
(b) device lifecycle operates using similar pci bus and current
    driver model. No need to invent new lifecycle scheme.
(c) To follow uniform device config model for VF and SFs, where
    user must be able to configure the attributes before binding a
    VF/SF to driver.
(d) In future, if needed a virtbus SF device can be bound to some
    other driver similar to how a PCI PF/VF device can be bound to
    mlx5_core or vfio_pci device.
(e) Reuse kernel's existing power management framework to
    suspend/resume SF devices.
(f) When using SFs in smartnic based system where SF eswitch port and
    SF slice are located on two different systems, user desire to
    configure eswitch before activating the SF slice device.
Bus allows to follow existing driver model for above needs as,
create->configure_multiple_attributes->deploy.

9. Why not further extend (or abuse) mdev bus?
Ans: Because
(a) mdev and vfio are coupled together from iommu perspective
(b) mdev shouldn't be further abuse to use mdev in current state
    as suggested by Greg-KH, Jason Gunthrope.
(c) One bus for "sw_api" purpose, hence the new virtbus.
(d) Few don't like weird guid/uuid of mdev for sub function purpose
(e) If needed to map a SF to VM using mdev, new mdev slice flavour
    should be created for lifecycling via devlink or may be continue
    life cycle via mdev sysfs; but do remaining slice handling via
    devlink similar to how PCI VF slices are managed.

10. Is it ok to reuse virtbus used for matching service?
Ans: Yes.
Greg-KH guided in [1] that its ok for virtbus to hold devices with
different attributes. Such as matching services vs SF devices where
SF device will have
(a) PCI BAR resource info
(b) IRQ info
(c) sfnum

Greg-KH also guided in [1] that its ok to anchor netdev and rdma dev
of a SF device to SF virtbus device while rdma and netdev of matching
service virtbus device to anchored at the parent PCI device.

[1] https://www.spinics.net/lists/linux-rdma/msg87124.html

11. Why platform and mfd framework is not used for creating slices?
Ans:
(a) As platform documentation clearly describes, platform devices
    are for a specific platform. They are autonomous devices, unlike
    user created slices.
(b) slices are dynamic in nature where each individual slice is
    created/destroyed independently.
(c) MFD (multi-function) devices are for a device that comprise
    more than one non-unique yet varying hardware functionality.
    While, here each slice is of same type as that of parent device,
    with less capabilities than the parent device. Given that mfd
    devices are built on top of platform devices, it inherits similar
    limitations as that of platform device.
(d) In few other threads Greg-KH said to not (ab)use platform
    devices for such purpose.
(e) Given that for networking related slices, slice is linked to
    eswitch which is managed by devlink, having lifecycle through
    devlink, improves locking/synchronization to single subsystem.

12. A given slice supports multiple class of devices, such as RDMA,
   netdev, vdpa device. How can I disable/enable such attributes of
   a slice?
Ans:
   Slice attributes should be extended in vendor neutral and also
   vendor specific way to enable/disable such attributes depending
   on attribute type. This should be handled in future patches.

virtbus limitations:
--------------------
1. virtbus bus will not support IOMMU like how pci bus does.
   Hence, all protocol devices (such as netdev, rdmadev) must use its
   parent's DMA device.
   RDMA subsystem will use ib_device->dma_device with existing ib_dma_
   wrapper.
   netdev will use mlx5_core_dev->pci_dev for dma purposes.
   This is suggested/hinted by Christoph and Jason.

   Why is it this way?
   (a) Because currently only rdma and netdev intend to use it.
   (b) In future if more use case arise, virtbus device can share DMAR
       and group of same parent PCI device for in-kernel usecase.

2. Dedicated or shared irq vector(s) assignment per sub function and
   its exposure in sysfs will not be supported in initial version.
   It will be supported in future series.

3. PCI BAR resource information as resource files in sysfs will not be
   supported in initial version.
   It will be supported in future series.

Post initial series:
--------------------
Following plumbing will be done post initial series.

1. Max number of SF slices resource at devlink level
2. Max IRQ vectors resource config at devlink level
3. sf lifecycle for kernel vDPA support, still under discussion
4. systemd/udev user space patches for persistent naming for
   netdev and rdma

Out-of-scope:
-------------
1. Creating mdev/vhost backend/vdpa devices using devlink slice APIs.
   To support vhost backend 'struct device' creation, SF slice
   create/set API can be extended to enable/disable specific
   capabilities of the slice. Such as enable/disable kernel vDPA
   feature or enable/disable RDMA device for a SF slice.
2. Similar extension is applicable for a VF slice.
3. Multi host support is orthogonal to this and it can be extended
   in future.

Example software/system view:
-----------------------------
       _______
      | admin |
      | user  |----------
      |_______|         |
          |             |
      ____|____       __|______            _____________
     |         |     |         |          |             |
     | devlink |     |   ovs   |          |    user     |
     | tool    |     |_________|          | application |
     |_________|         |                |_____________|
           |             |                   |       |
-----------|-------------|-------------------|-------|-----------
           |             |           +----------+   +----------+
           |             |           |  netdev  |   | rdma dev |
           |             |           +----------+   +----------+
      (slice cmds,       |              ^             ^
       add/del/set)      |              |             |
           |             |              +-------------|
      _____|___          |              |         ____|________
     |         |         |              |        |             |
     | devlink |   +------------+       |        | mlx5_core/ib|
     | kernel  |   | rep netdev |       |        | drivers     |
     |_________|   +------------+       |        |_____________|
           |             |              |             ^
     (slice cmds)        |              |        (probe/remove)
      _____|____         |              |         ____|_____
     |          |        |    +--------------+   |          |
     | mlx5_core|---------    | virtbus dev  |---|  virtbus |
     | driver   |             +--------------+   |  driver  |
     |__________|                                |__________|
           |                                          ^
      (sf add/del, events)                            |
           |                                   (device add/del)
      _____|____                                  ____|_____
     |          |                                |          |
     |  PCI NIC |---- admin activate/deactive    | mlx5_core|
     |__________|           deactive events ---->| driver   |
                                                 |__________|


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ