[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <BYAPR11MB30952BA538BB905331392A08D9119@BYAPR11MB3095.namprd11.prod.outlook.com>
Date: Fri, 16 Jul 2021 01:04:23 +0000
From: "Chen, Mike Ximing" <mike.ximing.chen@...el.com>
To: Greg KH <gregkh@...uxfoundation.org>,
"Williams, Dan J" <dan.j.williams@...el.com>
CC: Netdev <netdev@...r.kernel.org>,
David Miller <davem@...emloft.net>,
"Jakub Kicinski" <kuba@...nel.org>, Arnd Bergmann <arnd@...db.de>,
"Pierre-Louis Bossart" <pierre-louis.bossart@...ux.intel.com>,
"Brandeburg, Jesse" <jesse.brandeburg@...el.com>,
KVM list <kvm@...r.kernel.org>,
"Raj, Ashok" <ashok.raj@...el.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v10 00/20] dlb: introduce DLB device driver
> -----Original Message-----
> From: Greg KH <gregkh@...uxfoundation.org>
> Sent: Friday, May 14, 2021 10:33 AM
> To: Williams, Dan J <dan.j.williams@...el.com>
> > Hi Greg,
> >
> > So, for the last few weeks Mike and company have patiently waded
> > through my questions and now I think we are at a point to work through
> > the upstream driver architecture options and tradeoffs. You were not
> > alone in struggling to understand what this device does because it is
> > unlike any other accelerator Linux has ever considered. It shards /
> > load balances a data stream for processing by CPU threads. This is
> > typically a network appliance function / protocol, but could also be
> > any other generic thread pool like the kernel's padata. It saves the
> > CPU cycles spent load balancing work items and marshaling them through
> > a thread pool pipeline. For example, in DPDK applications, DLB2 frees
> > up entire cores that would otherwise be consumed with scheduling and
> > work distribution. A separate proof-of-concept, using DLB2 to
> > accelerate the kernel's "padata" thread pool for a crypto workload,
> > demonstrated ~150% higher throughput with hardware employed to manage
> > work distribution and result ordering. Yes, you need a sufficiently
> > high touch / high throughput protocol before the software load
> > balancing overhead coordinating CPU threads starts to dominate the
> > performance, but there are some specific workloads willing to switch
> > to this regime.
> >
> > The primary consumer to date has been as a backend for the event
> > handling in the userspace networking stack, DPDK. DLB2 has an existing
> > polled-mode-userspace driver for that use case. So I said, "great,
> > just add more features to that userspace driver and you're done". In
> > fact there was DLB1 hardware that also had a polled-mode-userspace
> > driver. So, the next question is "what's changed in DLB2 where a
> > userspace driver is no longer suitable?". The new use case for DLB2 is
> > new hardware support for a host driver to carve up device resources
> > into smaller sets (vfio-mdevs) that can be assigned to guests (Intel
> > calls this new hardware capability SIOV: Scalable IO Virtualization).
> >
> > Hardware resource management is difficult to handle in userspace
> > especially when bare-metal hardware events need to coordinate with
> > guest-VM device instances. This includes a mailbox interface for the
> > guest VM to negotiate resources with the host driver. Another more
> > practical roadblock for a "DLB2 in userspace" proposal is the fact
> > that it implements what are in-effect software-defined-interrupts to
> > go beyond the scalability limits of PCI MSI-x (Intel calls this
> > Interrupt Message Store: IMS). So even if hardware resource management
> > was awkwardly plumbed into a userspace daemon there would still need
> > to be kernel enabling for device-specific extensions to
> > drivers/vfio/pci/vfio_pci_intrs.c for it to understand the IMS
> > interrupts of DLB2 in addition to PCI MSI-x.
> >
> > While that still might be solvable in userspace if you squint at it, I
> > don't think Linux end users are served by pushing all of hardware
> > resource management to userspace. VFIO is mostly built to pass entire
> > PCI devices to guests, or in coordination with a kernel driver to
> > describe a subset of the hardware to a virtual-device (vfio-mdev)
> > interface. The rub here is that to date kernel drivers using VFIO to
> > provision mdevs have some existing responsibilities to the core kernel
> > like a network driver or DMA offload driver. The DLB2 driver offers no
> > such service to the kernel for its primary role of accelerating a
> > userspace data-plane. I am assuming here that the padata
> > proof-of-concept is interesting, but not a compelling reason to ship a
> > driver compared to giving end users competent kernel-driven
> > hardware-resource assignment for deploying DLB2 virtual instances into
> > guest VMs.
> >
> > My "just continue in userspace" suggestion has no answer for the IMS
> > interrupt and reliable hardware resource management support
> > requirements. If you're with me so far we can go deeper into the
> > details, but in answer to your previous questions most of the TLAs
> > were from the land of "SIOV" where the VFIO community should be
> > brought in to review. The driver is mostly a configuration plane where
> > the fast path data-plane is entirely in userspace. That configuration
> > plane needs to manage hardware events and resourcing on behalf of
> > guest VMs running on a partitioned subset of the device. There are
> > worthwhile questions about whether some of the uapi can be refactored
> > to common modules like uacce, but I think we need to get to a first
> > order understanding on what DLB2 is and why the kernel has a role
> > before diving into the uapi discussion.
> >
> > Any clearer?
>
> A bit, yes, thanks.
>
> > So, in summary drivers/misc/ appears to be the first stop in the
> > review since a host driver needs to be established to start the VFIO
> > enabling campaign. With my community hat on, I think requiring
> > standalone host drivers is healthier for Linux than broaching the
> > subject of VFIO-only drivers. Even if, as in this case, the initial
> > host driver is mostly implementing a capability that could be achieved
> > with a userspace driver.
>
> Ok, then how about a much "smaller" kernel driver for all of this, and a whole lot of documentation to
> describe what is going on and what all of the TLAs are.
>
> thanks,
>
> greg k-h
Hi Greg,
tl;dr: We have been looking into various options to reduce the kernel driver size and ABI surface, such as moving more responsibility to user space, reusing existing kernel modules (uacce, for example), and converting functionality from ioctl to sysfs. End result 10 ioctls will be replaced by sysfs, the rest of them (20 ioctls) will be replaced by configfs. Some concepts are moved to device-special files rather than ioctls that produce file descriptors.
Details:
We investigated the possibility of using uacce (https://www.kernel.org/doc/html/latest/misc-devices/uacce.html) in our kernel driver. The uacce interface fits well with accelerators that process user data with known source and destination addresses. For a DLB (Dynamic Load Balancer), however, the destination port depends on the system load and is unknown to the application. While uacce exposes "queues" to user, the dlb driver has to handle much complicated resource managements, such as credits, ports, queues and domains. We would have to add a lot of more concepts and code, which are not useful for other accelerators, in uacce to make it working for DLB. This may also lead to a bigger code size over all.
We also took a another look at moving resource management functionality from kernel space to user space. Much of kernel driver supports both PF (Physical Function) on host and VFs (Virtual Functions) on VMs. Since only the PF on the host has permissions to setup resource and configure the DLB HW, all the requests on VFs are forwarded to PF via the VF-PF mail boxes, which are handled by the kernel driver. The driver also maintains various virtual id to physical id translations (for VFs, ports, queues, etc), and provides virtual-to-physical id mapping info DLB HW so that an application in VM can access the resources with virtual IDs only. Because of the VF/VDEV support, we have to keep the resource management, which is more than one half of the code size, in the driver.
To simplify the user interface, we explored the ways to reduce/eliminate ioctl interface, and found that we can utilize configfs for many of the DLB functionalities. Our current plan is to replace all the ioctls in the driver with sysfs and configfs. We will use configfs for most of setup and configuration for both physical function and virtual functions. This may not reduce the overall driver size greatly, but it will lessen much of ABI maintenance burden (with the elimination of ioctls). I hope this is something that is in line with what you like to see for the driver.
Thanks
Mike
Powered by blists - more mailing lists