lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180608113008.76cbf425@xeon-e3>
Date:   Fri, 8 Jun 2018 11:30:08 -0700
From:   Stephen Hemminger <stephen@...workplumber.org>
To:     "Michael S. Tsirkin" <mst@...hat.com>
Cc:     Alexander Duyck <alexander.duyck@...il.com>,
        "Samudrala, Sridhar" <sridhar.samudrala@...el.com>,
        Jiri Pirko <jiri@...nulli.us>,
        KY Srinivasan <kys@...rosoft.com>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        David Miller <davem@...emloft.net>,
        Netdev <netdev@...r.kernel.org>,
        Stephen Hemminger <sthemmin@...rosoft.com>
Subject: Re: [PATCH net] failover: eliminate callback hell

On Thu, 7 Jun 2018 20:22:15 +0300
"Michael S. Tsirkin" <mst@...hat.com> wrote:

> On Thu, Jun 07, 2018 at 09:17:42AM -0700, Stephen Hemminger wrote:
> > On Thu, 7 Jun 2018 18:41:31 +0300
> > "Michael S. Tsirkin" <mst@...hat.com> wrote:
> >   
> > > > > Why would DPDK care what we do in the kernel? Isn't it just slapping
> > > > > vfio-pci on the netdevs it sees?    
> > > > 
> > > > Alex, you are correct for Intel devices; but DPDK on Azure is not Intel based.,.
> > > > The DPDK support uses:
> > > >  * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to
> > > >    userspace. This means VF netdev device must exist and be visible.
> > > >  * Slow path using kernel netvsc device, TAP and BPF to get exception
> > > >    path packets to userspace.
> > > >  * A autodiscovery mechanism that to set all this up that relies on
> > > >    2 device model and sysfs.    
> > > 
> > > Could you describe what does it look for exactly? What will break if
> > > instead of MLX5 being a child of the PV, it's a child of the failover
> > > device?  
> > 
> > So in DPDK there is an internal four device model:
> > 	1. failsafe is like failover in your model
> > 	2. TAP is used like netvsc in kernel
> > 	3. MLX5 is the VF
> > 	4. vdev_netvsc is a pseudo device whose only reason to exist
> > 	   is to glue everything together.
> > 
> > Digging deeper inside...
> > 
> > Vdev_netvsc does:
> >    * driver is started in a convuluted way off device arguments
> >    * probe routine for driver runs
> >       - scans list of kernel interfaces in sysfs
> >       - matches those using VMBUS   
> 
> Could you tell a bit more what does this step entail?

Quick code high/low lights.


	ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe, 1, name,
					kvargs, specified, &matched);
static int
vdev_netvsc_foreach_iface(int (*func)(const struct if_nameindex *iface,
				      const struct ether_addr *eth_addr,
				      va_list ap), int is_netvsc, ...)
{
	struct if_nameindex *iface = if_nameindex();


	for (i = 0; iface[i].if_name; ++i) {

		is_netvsc_ret = vdev_netvsc_iface_is_netvsc(&iface[i]) ? 1 : 0;
		if (is_netvsc ^ is_netvsc_ret)
			continue;

		strlcpy(req.ifr_name, iface[i].if_name, sizeof(req.ifr_name));
		if (ioctl(s, SIOCGIFHWADDR, &req) == -1) {
		}

		memcpy(eth_addr.addr_bytes, req.ifr_hwaddr.sa_data,
		       RTE_DIM(eth_addr.addr_bytes));

		ret = func(&iface[i], &eth_addr, ap);  << func is vdev_netvsc_netvsc_probe


static int
vdev_netvsc_netvsc_probe(const struct if_nameindex *iface,
			 const struct ether_addr *eth_addr,
			 va_list ap)
{

	/* Routed NetVSC should not be probed. */
	if (vdev_netvsc_has_route(iface, AF_INET) ||
	    vdev_netvsc_has_route(iface, AF_INET6)) {
		if (!specified)
			return 0;
		DRV_LOG(WARNING, "probably using routed NetVSC interface \"%s\""
			" (index %u)", iface->if_name, iface->if_index);
	}
	/* Create interface context. */
	ctx = calloc(1, sizeof(*ctx));
...


> 
> >       - skip netvsc devices that have an IPV4 route
> >    * scan for PCI devices that have same MAC address as kernel netvsc
> >      devices discovered in previous step
> >    * add these interfaces to arguments to failsafe
> > 
> > Then failsafe configures based on arguments on device
> > 
> > The code works but is specific to the Azure hardware model, and exposes lots
> > of things to application that it should not have to care about.
> > 
> > If you  try and walk through this code in DPDK, you will see why I have developed
> > a dislike for high levels of indirection.
> > 
> > 
> > 	     
> 
> Thanks that was helpful!  I'll try to poke at it next week.  Just from
> the description it seems the kernel is merely used to locate the MAC
> address through sysfs and that for this DPDK code to keep working the
> hidden device must be hidden from it in sysfs - is that a fair summary?

What is the point of the 3 device model? What value does it have
to userspace? How would userspace use each of the three devices.
Going back to 3 device model really doesn't make sense to me if
there is not visible benefit.

Some other considerations:
   * there is ongoing development to support RDMA failover as
     well in netvsc.

   * there is a new driver which implements the VMBUS protocol
     in userspace for DPDK. This gets rid of several layers and
     removes any special scanning code. The vmbus device is
     unbound from netvsc and bound to UIO device.  Then the user
     space DPDK driver manages all the host signalling events
     including VF discovery. It is really 2 device model done
     all in userspace. The kernel device is still needed when
     the VF is mellanox; because that is how the MLX DPDK driver
     rolls.

  * what about nested KVM on Hyper-V? Would it make sense to
    have a way to pass subset of VF queues to guest?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ