netdev - Re: [PATCH net-next 00/19] Mellanox, mlx5 sub function support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191109004426.GB31761@ziepe.ca>
Date:   Fri, 8 Nov 2019 20:44:26 -0400
From:   Jason Gunthorpe <jgg@...pe.ca>
To:     Jakub Kicinski <jakub.kicinski@...ronome.com>
Cc:     Parav Pandit <parav@...lanox.com>, Jiri Pirko <jiri@...nulli.us>,
        David M <david.m.ertman@...el.com>,
        "gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>,
        "alex.williamson@...hat.com" <alex.williamson@...hat.com>,
        "davem@...emloft.net" <davem@...emloft.net>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Saeed Mahameed <saeedm@...lanox.com>,
        "kwankhede@...dia.com" <kwankhede@...dia.com>,
        "leon@...nel.org" <leon@...nel.org>,
        "cohuck@...hat.com" <cohuck@...hat.com>,
        Jiri Pirko <jiri@...lanox.com>,
        "linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
        Or Gerlitz <gerlitz.or@...il.com>
Subject: Re: [PATCH net-next 00/19] Mellanox, mlx5 sub function support

On Fri, Nov 08, 2019 at 01:45:59PM -0800, Jakub Kicinski wrote:

> > IMHO, mdev has amdev_parent_ops structure clearly intended to link it
> > to vfio, so using a mdev for something not related to vfio seems like
> > a poor choice.
> 
> Yes, my suggestion to use mdev was entirely based on the premise that
> the purpose of this work is to get vfio working.. otherwise I'm unclear
> as to why we'd need a bus in the first place. If this is just for
> containers - we have macvlan offload for years now, with no need for a
> separate device.

This SF thing is a full fledged VF function, it is not at all like
macvlan. This is perhaps less important for the netdev part of the
world, but the difference is very big for the RDMA side, and should
enable VFIO too..
 
> On the RDMA/Intel front, would you mind explaining what the main
> motivation for the special buses is? I'm a little confurious.

Well, the issue is driver binding. For years we have had these
multi-function netdev drivers that have a single PCI device which must
bind into multiple subsystems, ie mlx5 does netdev and RDMA, the cxgb
drivers do netdev, RDMA, SCSI initiator, SCSI target, etc. [And I
expect when NVMe over TCP rolls out we will have drivers like cxgb4
binding to 6 subsytems in total!]

Today most of this is a big hack where the PCI device binds to the
netdev driver and then the other drivers in different subsystems
'discover' that an appropriate netdev is plugged in using various
unique, hacky and ugly means. For instance cxgb4 duplicates a chunk of
the device core, see cxgb4_register_uld() for example. Other drivers
try to use netdev notifiers, and various other wild things.

So, the general concept is to use the driver model to manage driver
binding. A multi-subsystem driver would have several parts:

- A pci_driver which binds to the pci_device (the core)
  It creates, on a bus, struct ??_device's for the other subsystems
  that this HW supports. ie if the chip supports netdev then a 
  ??_device that binds to the netdev driver is created, same for RDMA

- A ??_driver in netdev binds to the device and accesses the core API
- A ??_driver in RDMA binds to the device and accesses the core API
- A ??_driver in SCSI binds to the device and accesses the core API

Now the driver model directly handles all binding, autoloading,
discovery, etc, and 'netdev' is just another consumer of 'core'
functionality.

For something like mlx5 the 'core' is the stuff in
drivers/net/ethernet/mellanox/mlx5/core/*.c, give or take. It is
broadly generic stuff like send commands, create queues, manage HW
resources, etc.

There has been some lack of clarity on what the ?? should be. People
have proposed platform and MFD, and those seem to be no-goes. So, it
looks like ?? will be a mlx5_driver on a mlx5_bus, and Intel will use
an ice_driver on a ice_bus, ditto for cxgb4, if I understand Greg's
guidance.

Though I'm wondering if we should have a 'multi_subsystem_device' that
was really just about passing a 'void *core_handle' from the 'core'
(ie the bus) to the driver (ie RDMA, netdev, etc). 

It seems weakly defined, but also exactly what every driver doing this
needs.. It is basically what this series is abusing mdev to accomplish.

> My understanding is MFD was created to help with cases where single
> device has multiple pieces of common IP in it. 

MFD really seems to be good at splitting a device when the HW is
orthogonal at the register level. Ie you might have regs 100-200 for
ethernet and 200-300 for RDMA.

But this is not how modern HW works, the functional division is more
subtle and more software based. ie on most devices a netdev and rdma
queue are nearly the same, just a few settings make them function
differently.

So what is needed isn't a split of register set like MFD specializes
in, but a unique per-driver API between the 'core' and 'subsystem'
parts of the multi-subsystem device.

> Do modern RDMA cards really share IP across generations? 

What is a generation? Mellanox has had a stable RDMA driver across
many sillicon generations. Intel looks like their new driver will
support at least the last two or more sillicon generations..

RDMA drivers are monstrous complex things, there is a big incentive to
not respin them every time a new chip comes out.

> Is there a need to reload the drivers for the separate pieces (I
> wonder if the devlink reload doesn't belong to the device model :().

Yes, it is already done, but without driver model support the only way
to reload the rdma driver is to unload the entire module as there is
no 'unbind'

Jason