netdev - RE: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <DFDF335405C17848924A094BC35766CF0A953308@SHSMSX104.ccr.corp.intel.com>
Date:	Tue, 7 Oct 2014 15:21:15 +0000
From:	"Zhou, Danny" <danny.zhou@...el.com>
To:	Willem de Bruijn <willemb@...gle.com>,
	John Fastabend <john.fastabend@...il.com>
CC:	Daniel Borkmann <dborkman@...hat.com>,
	Florian Westphal <fw@...len.de>,
	"gerlitz.or@...il.com" <gerlitz.or@...il.com>,
	Hannes Frederic Sowa <hannes@...essinduktion.org>,
	Network Development <netdev@...r.kernel.org>,
	"Ronciak, John" <john.ronciak@...el.com>,
	Amir Vadai <amirv@...lanox.com>,
	Eric Dumazet <eric.dumazet@...il.com>
Subject: RE: [net-next PATCH v1 1/3] net: sched: af_packet support for
 direct ring access


> -----Original Message-----
> From: Willem de Bruijn [mailto:willemb@...gle.com]
> Sent: Tuesday, October 07, 2014 12:24 PM
> To: John Fastabend
> Cc: Daniel Borkmann; Florian Westphal; gerlitz.or@...il.com; Hannes Frederic Sowa; Network Development; Ronciak, John; Amir
> Vadai; Eric Dumazet; Zhou, Danny
> Subject: Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
> 
> > Supporting some way to steer traffic to a queue
> > is the _only_ hardware requirement to support the interface,
> 
> I would not impose his constraint. There may be legitimate use
> cases for taking over all queues of a device. For instance, when
> this is a secondary nic that does not carry any control traffic.

For the secondary NIC that carries the data plane traffics only, you can use UIO or VFIO 
to map the entire NIC's entire I/O space to user space. Then the user space poll-mode driver, 
like those have been supported and open-sourced in DPDK and those supports 
Mellanox/Emulex NICs but not open-sourced, can drive the NIC as a sole driver in user space. 

> 
> > Typically in an af_packet interface a packet_type handler is
> > registered and used to filter traffic to the socket and do other
> > things such as fan out traffic to multiple sockets. In this case the
> > networking stack is being bypassed so this code is not run. So the
> > hardware must push the correct traffic to the queues obtained from
> > the ndo callback ndo_split_queue_pairs().
> 
> Why does the interface work at the level of queue_pairs instead of
> individual queues?

The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O 
on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and 
egress traffics respectively, although the flow director only applies to ingress traffics.

> 
> >         /* Get the layout of ring space offset, page_sz, cnt */
> >         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
> >                    &info, &optlen);
> >
> >         /* request some queues from the driver */
> >         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> >                    &qpairs_info, sizeof(qpairs_info));
> >
> >         /* if we let the driver pick us queues learn which queues
> >          * we were given
> >          */
> >         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> >                    &qpairs_info, sizeof(qpairs_info));
> 
> If ethtool -U is used to steer traffic to a specific descriptor queue,
> then the setsockopt can pass the exact id of that queue and there
> is no need for a getsockopt follow-up.

Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via 
setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and 
available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info 
as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable. 
But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need 
to call getsocketopt.

> 
> >         /* And mmap queue pairs to user space */
> >         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
> >              MAP_SHARED, fd, 0);
> 
> How will packet data be mapped and how will userspace translate
> from paddr to vaddr? Is the goal to maintain long lived mappings
> and instruct drivers to allocate from this restricted range (to
> avoid per-packet system calls and vma operations)?
> 

Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues 
completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages 
are translated to paddr) to fill in the packet descriptors.
As of security concern raised in previous discussion, the reason we think(BTW, correct me if I am wrong)
af_packet  is most suitable is because only user application with root permission is allowed to successfully 
split-off queue pairs and mmap a small window of PCIe I/O space to user space, so concern regarding "device 
can DMA from/to any arbitrary physical memory." is not that big. As all user space device drivers based on 
UIO mechanism has the same concern issue, VFIO adds protection but it is based on IOMMU which is
specific to Intel silicons.

> For throughput-oriented workloads, the syscall overhead
> involved in kicking the nic (on tx, or for increasing the ring
> consumer index on rx) can be amortized. And the operation
> can perhaps piggy-back on interrupts or other events
> (as long as interrupts are not disabled for full userspace
> polling). Latency would be harder to satisfy while maintaining
> some kernel policy enforcement. An extreme solution
> uses an asynchronously busy polling kernel worker thread
> (at high cycle cost, so acceptable for few workloads).
> 
> When keeping the kernel in the loop, it is possible to do
> some basic sanity checking and transparently translate between
> vaddr and paddr, even when exposing the hardware descriptors
> directly. Though at this point it may be just as cheap to expose
> an idealized virtualized descriptor format and copy fields between
> that and device descriptors.
> 
> One assumption underlying exposing the hardware descriptors
> is that they are quire similar between devices. How true is this
> in the context of formats that span multiple descriptors?
> 

Packet descriptors format varies for different vendors. On Intel NICs, 1G/10G/40G NICs have 
totally different formats. Even a same Intel 10G/40G NIC supports at least 2 different descriptor 
formats. IMHO, the idea behind those patches intent to skip descriptor difference among devices, 
as it just maps certain I/O space pages to user space and user space "slave" NIC driver can handle 
it using different descriptor struct based on vendor/device ID. But I am open to add support of generic 
packet descriptor format description, per David M' suggestion.

> > + * int (*ndo_split_queue_pairs) (struct net_device *dev,
> > + *                              unsigned int qpairs_start_from,
> > + *                              unsigned int qpairs_num,
> > + *                              struct sock *sk)
> > + *     Called to request a set of queues from the driver to be
> > + *     handed to the callee for management. After this returns the
> > + *     driver will not use the queues.
> 
> Are these queues also taken out of ethtool management, or is
> this equivalent to taking removing them from the rss set with
> ethtool -X?

As a master driver, the NIC kernel driver still takes control of flow director as a ethtool backend. Generally, 
not all queues are initialized and used by NIC kernel driver, which reports actually used rx/tx numbers to stacks. 
Before splitting off certain queues, if you want use ethtool to direct traffics to those unused queues, 
ethtool reports invalid argument. Once certain stack-unaware queues are allocated for user space slave driver, 
ethtool allows directing packets to them as the NIC driver maintains a data struct about which queues are visible 
and used by kernel, which are used by user space.