netdev - RE: [net-next v4 00/15] Add mlx5 subfunction support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BY5PR12MB43221CE397D6310F2B04D9B4DCC60@BY5PR12MB4322.namprd12.prod.outlook.com>
Date:   Tue, 15 Dec 2020 05:48:25 +0000
From:   Parav Pandit <parav@...dia.com>
To:     Alexander Duyck <alexander.duyck@...il.com>,
        Saeed Mahameed <saeed@...nel.org>
CC:     "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Jason Gunthorpe <jgg@...dia.com>,
        Leon Romanovsky <leonro@...dia.com>,
        Netdev <netdev@...r.kernel.org>,
        "linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
        David Ahern <dsahern@...nel.org>,
        Jacob Keller <jacob.e.keller@...el.com>,
        Sridhar Samudrala <sridhar.samudrala@...el.com>,
        "Ertman, David M" <david.m.ertman@...el.com>,
        Dan Williams <dan.j.williams@...el.com>,
        Kiran Patil <kiran.patil@...el.com>,
        Greg KH <gregkh@...uxfoundation.org>
Subject: RE: [net-next v4 00/15] Add mlx5 subfunction support


> From: Alexander Duyck <alexander.duyck@...il.com>
> Sent: Tuesday, December 15, 2020 7:24 AM
> 
> On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@...nel.org>
> wrote:
> >
> > Hi Dave, Jakub, Jason,
> >
> 
> Just to clarify a few things for myself. You mention virtualization and SR-IOV
> in your patch description but you cannot support direct assignment with this
> correct? 
Correct. it cannot be directly assigned.

> The idea here is simply logical partitioning of an existing network
> interface, correct? 
No. Idea is to spawn multiple functions from a single PCI device.
These functions are not born in PCI device and in OS until they are created by user.
Jason and Saeed explained this in great detail few weeks back in v0 version of the patchset at [1], [2] and [3].
I better not repeat all of it here again. Please go through it.
If you may want to read precursor to it, RFC from Jiri at [4] is also explains this in great detail.

> So this isn't so much a solution for virtualization, but may
> work better for containers. I view this as an important distinction to make as
> the first thing that came to mind when I read this was mediated devices
> which is similar, but focused only on the virtualization case:
> https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-
> device.html
>
Managing subfunction using medicated device is already ruled out last year at [5] as it is the abuse of the mdev bus for this purpose + has severe limitations of managing the subfunction device.
We are not going back to it anymore.
It will be duplicating lot of the plumbing which exists in devlink, netlink, auxiliary bus and more.
 
> Rather than calling this a subfunction, would it make more sense to call it
> something such as a queue set? 
No, queue is just one way to send and receive data/packets.
Jason and Saeed explained and discussed  this piece to you and others during v0 few weeks back at [1], [2], [3].
Please take a look.

> So in terms of ways to go I would argue this is likely better. However one
> downside is that we are going to end up seeing each subfunction being
> different from driver to driver and vendor to vendor which I would argue
> was also one of the problems with SR-IOV as you end up with a bit of vendor
> lock-in as a result of this feature since each vendor will be providing a
> different interface.
>
Each and several vendors provided unified interface for managing VFs. i.e.
(a) enable/disable was via vendor neutral sysfs
(b) sriov capability exposed via standard pci capability and sysfs
(c) sriov vf config (mac, vlan, rss, tx rate, spoof check trust) are using vendor agnostic netlink
Even though the driver's internal implementation largely differs on how trust, spoof, mac, vlan rate etc are enforced.

So subfunction feature/attribute/functionality will be implemented differently internally in the driver matching vendor's device, for reasonably abstract concept of 'subfunction'.

> > A Subfunction supports eswitch representation through which it
> > supports tc offloads. User must configure eswitch to send/receive
> > packets from/to subfunction port.
> >
> > Subfunctions share PCI level resources such as PCI MSI-X IRQs with
> > their other subfunctions and/or with its parent PCI function.
> 
> This piece to the architecture for this has me somewhat concerned. If all your
> resources are shared and 
All resources are not shared.

> you are allowing devices to be created
> incrementally you either have to pre-partition the entire function which
> usually results in limited resources for your base setup, or free resources
> from existing interfaces and redistribute them as things change. I would be
> curious which approach you are taking here? So for example if you hit a
> certain threshold will you need to reset the port and rebalance the IRQs
> between the various functions?
No. Its works bit differently for mlx5 device.
When base function is started, it started as if it doesn't have any subfunctions.
When subfunction is instantiated, it spawns new resources in device (hw, fw, memory) depending on how much a function wants.

For example, PCI PF uses BAR 0, while subfunctions uses BAR 2.
For IRQs, subfunction instance shares the IRQ with its parent/hosting PCI PF.
In future, yes, a dedicated IRQs per SF is likely desired.
Sridhar also talked about limiting number of queues to a subfunction.
I believe there will be resources/attributes of the function to be controlled.
devlink already provides rich interface to achieve that using devlink resources [8].

[..]

> > $ ip link show
> > 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state
> DOWN mode DEFAULT group default qlen 1000
> >     link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
> >     altname enp6s0f0np0
> > 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
> mode DEFAULT group default qlen 1000
> >     link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff>
> 
> I assume that p0sf88 is supposed to be the newly created subfunction.
> However I thought the naming was supposed to be the same as what you are
> referring to in the devlink, or did I miss something?
>
I believe you are confused with the representor netdevice of subfuction with devices of subfunction. (netdev, rdma, vdpa etc).
I suggest that please refer to the diagram in patch_15 in [7] to see the stack, modules, objects.
Hope below description clarifies a bit.
There are two netdevices.
(a) representor netdevice, attached to the devlink port of the eswitch
(b) netdevice of the SF used by the end application (in your example, this is assigned to container).
 
Both netdevice follow obviously a different naming scheme.
Representor netdevice follows naming scheme well defined in kernel + systemd/udev v245 and higher.
It is based on phys_port_name sysfs attribute.
This is same for existing PF and SF representors exist for year+ now. Further used by subfunction.

For subfunction netdevice (p0s88), system/udev will be extended. I put example based on my few lines of udev rule that reads
phys_port_name and user supplied sfnum, so that user exactly knows which interface to assign to container.

> > After use inactivate the function:
> > $ devlink port function set ens2f0npf0sf88 state inactive
> >
> > Now delete the subfunction port:
> > $ devlink port del ens2f0npf0sf88
> 
> This seems wrong to me as it breaks the symmetry with the port add
> command and
Example of the representor device is only to make life easier for the user.
Devlink port del command works based on the devlink port index, just like existing devlink port commands (get,set,split,unsplit).
I explained this in a thread with Sridhar at [6].
In short devlink port del <bus/device_name/port_index command is just fine.
Port index is unique handle for the devlink instance that user refers to delete, get, set port and port function attributes post its creation.
I choose the representor netdev example because it is more intuitive to related to, but port index is equally fine and supported.

> assumes you have ownership of the interface in the host. I
> would much prefer to to see the same arguments that were passed to the
> add command being used to do the teardown as that would allow for the
> parent function to create the object, assign it to a container namespace, and
> not need to pull it back in order to destroy it.
Parent function will not have same netdevice name as that of representor netdevice, because both devices exist in single system for large part of the use cases.
So port delete command works on the port index.
Host doesn't need to pull it back to destroy it. It is destroyed via port del command.

[1] https://lore.kernel.org/netdev/20201112192424.2742-1-parav@nvidia.com/
[2] https://lore.kernel.org/netdev/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/
[3] https://lore.kernel.org/netdev/20201120161659.GE917484@nvidia.com/
[4] https://lore.kernel.org/netdev/20200501091449.GA25211@nanopsycho.orion/
[5] https://lore.kernel.org/netdev/20191107160448.20962-1-parav@mellanox.com/
[6] https://lore.kernel.org/netdev/BY5PR12MB43227784BB34D929CA64E315DCCA0@BY5PR12MB4322.namprd12.prod.outlook.com/
[7] https://lore.kernel.org/netdev/20201214214352.198172-16-saeed@kernel.org/T/#u
[8] https://man7.org/linux/man-pages/man8/devlink-resource.8.html