netdev - Re: [net-next v4 00/15] Add mlx5 subfunction support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201215210319.GC552508@nvidia.com>
Date:   Tue, 15 Dec 2020 17:03:19 -0400
From:   Jason Gunthorpe <jgg@...dia.com>
To:     Alexander Duyck <alexander.duyck@...il.com>
CC:     Parav Pandit <parav@...dia.com>, Saeed Mahameed <saeed@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        "Leon Romanovsky" <leonro@...dia.com>,
        Netdev <netdev@...r.kernel.org>,
        "linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
        David Ahern <dsahern@...nel.org>,
        Jacob Keller <jacob.e.keller@...el.com>,
        "Sridhar Samudrala" <sridhar.samudrala@...el.com>,
        "Ertman, David M" <david.m.ertman@...el.com>,
        Dan Williams <dan.j.williams@...el.com>,
        "Kiran Patil" <kiran.patil@...el.com>,
        Greg KH <gregkh@...uxfoundation.org>
Subject: Re: [net-next v4 00/15] Add mlx5 subfunction support

On Tue, Dec 15, 2020 at 10:47:36AM -0800, Alexander Duyck wrote:

> > Jason and Saeed explained this in great detail few weeks back in v0 version of the patchset at [1], [2] and [3].
> > I better not repeat all of it here again. Please go through it.
> > If you may want to read precursor to it, RFC from Jiri at [4] is also explains this in great detail.
> 
> I think I have a pretty good idea of how the feature works. My concern
> is more the use of marketing speak versus actual functionality. The
> way this is being setup it sounds like it is useful for virtualization
> and it is not, at least in its current state. It may be at some point
> in the future but I worry that it is really going to muddy the waters
> as we end up with yet another way to partition devices.

If we do a virtualization version then it will take a SF and instead
of loading a mlx5_core on the SF aux device, we will load some
vfio_mdev_mlx5 driver which will convert the SF aux device into a
/dev/vfio/*

This is essentially the same as how you'd take a PCI VF and replace
mlx5_core with vfio-pci to get /dev/vfio/*. It has to be a special
mdev driver because it sits on the SF aux device, not on the VF PCI
device.

The vfio_mdev_mlx5 driver will create what Intel calls an SIOV ADI
from the SF, in other words the SF is already a superset of what a
SIOV ADI should be.

This matches very nicely the driver model in Linux, and I don't think
it becomes more muddied as we go along. If anything it is becoming
more clear and sane as things progress.

> I agree with you on that. My thought was more the fact that the two
> can be easily confused. If we are going to do this we need to define
> that for networking devices perhaps that using the mdev interface
> would be deprecated and we would need to go through devlink. However
> before we do that we need to make sure we have this completely
> standardized.

mdev is for creating /dev/vfio/* interfaces in userspace. Using it for
anything else is a bad abuse of the driver model.

We had this debate endlessly already.

AFAIK, there is nothing to deprecate, there are no mdev_drivers in
drivers/net, and none should ever be added. The only mdev_driver that
should ever exists is in vfio_mdev.c

If someone is using a mdev_driver in drivers/net out of tree then they
will need to convert to an aux driver for in-tree.

> Yeah, I recall that. However I feel like it is being oversold. It
> isn't "SR-IOV done right" it seems more like "VMDq done better". The
> fact that interrupts are shared between the subfunctions is telling.

The interrupt sharing is a consequence of having an ADI-like model
without relying on IMS. When IMS works then shared interrupts won't be
very necessary. Otherwise there is no choice but to share the MSI
table of the function.

> That is exactly how things work for Intel parts when they do VMDq as
> well. The queues are split up into pools and a block of queues belongs
> to a specific queue. From what I can can tell the only difference is
> that there is isolation of the pool into specific pages in the BAR.
> Which is essentially a requirement for mediated devices so that they
> can be direct assigned.

No, I said this to Jakub, mlx5 SFs have very little to do with
queues. There is no some 'queue' HW element that needs partitioning.

The SF is a hardware security boundary that wraps every operation a
mlx5 device can do. This is why it is an ADI. It is not a crappy ADI
that relies on hypervisor emulation, it is the real thing, just like a
SRIOV VF. You stick it in the VM and the guest can directly talk to
the HW. The HW provides the security.

I can't put focus on this enough: A mlx5 SF can run a *full RDMA
stack*. This means the driver can create all the RDMA HW objects and
resources under the SF. This is *not* just steering some ethernet
traffic to a few different ethernet queues like VMDq is.

The Intel analog to a SF is a *full virtual function* on one of the
Intel iWarp capable NICs, not VMDq.

> Assuming at some point one of the flavours is a virtio-net style
> interface you could eventually get to the point of something similar
> to what seems to have been the goal of mdev which was meant to address
> these two points.

mlx5 already supports VDPA virtio-net on PF/VF and with this series SF
too.

ie you can take a SF, bind the vdpa_mlx5 driver, and get a fully HW
accelerated "ADI" that does virtio-net. This can be assigned to a
guest and shows up as a PCI virtio-net netdev. With VT-d guest packet
tx/rx on this netdev never uses the hypervisor CPU.

> The point is that we should probably define some sort of standard
> and/or expectations on what should happen when you spawn a new
> interface. Would it be acceptable for the PF and existing subfunctions
> to have to reset if you need to rebalance the IRQ distribution, or
> should they not be disrupted when you spawn a new interface?

It is best to think of the SF as an ADI, so if you change something in
the PF and that causes the driver attached to the ADI in a VM to
reset, is that OK? I'd say no.

Jason