[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <DFDF335405C17848924A094BC35766CF0A953308@SHSMSX104.ccr.corp.intel.com>
Date: Tue, 7 Oct 2014 15:21:15 +0000
From: "Zhou, Danny" <danny.zhou@...el.com>
To: Willem de Bruijn <willemb@...gle.com>,
John Fastabend <john.fastabend@...il.com>
CC: Daniel Borkmann <dborkman@...hat.com>,
Florian Westphal <fw@...len.de>,
"gerlitz.or@...il.com" <gerlitz.or@...il.com>,
Hannes Frederic Sowa <hannes@...essinduktion.org>,
Network Development <netdev@...r.kernel.org>,
"Ronciak, John" <john.ronciak@...el.com>,
Amir Vadai <amirv@...lanox.com>,
Eric Dumazet <eric.dumazet@...il.com>
Subject: RE: [net-next PATCH v1 1/3] net: sched: af_packet support for
direct ring access
> -----Original Message-----
> From: Willem de Bruijn [mailto:willemb@...gle.com]
> Sent: Tuesday, October 07, 2014 12:24 PM
> To: John Fastabend
> Cc: Daniel Borkmann; Florian Westphal; gerlitz.or@...il.com; Hannes Frederic Sowa; Network Development; Ronciak, John; Amir
> Vadai; Eric Dumazet; Zhou, Danny
> Subject: Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
>
> > Supporting some way to steer traffic to a queue
> > is the _only_ hardware requirement to support the interface,
>
> I would not impose his constraint. There may be legitimate use
> cases for taking over all queues of a device. For instance, when
> this is a secondary nic that does not carry any control traffic.
For the secondary NIC that carries the data plane traffics only, you can use UIO or VFIO
to map the entire NIC's entire I/O space to user space. Then the user space poll-mode driver,
like those have been supported and open-sourced in DPDK and those supports
Mellanox/Emulex NICs but not open-sourced, can drive the NIC as a sole driver in user space.
>
> > Typically in an af_packet interface a packet_type handler is
> > registered and used to filter traffic to the socket and do other
> > things such as fan out traffic to multiple sockets. In this case the
> > networking stack is being bypassed so this code is not run. So the
> > hardware must push the correct traffic to the queues obtained from
> > the ndo callback ndo_split_queue_pairs().
>
> Why does the interface work at the level of queue_pairs instead of
> individual queues?
The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O
on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and
egress traffics respectively, although the flow director only applies to ingress traffics.
>
> > /* Get the layout of ring space offset, page_sz, cnt */
> > getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
> > &info, &optlen);
> >
> > /* request some queues from the driver */
> > setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> > &qpairs_info, sizeof(qpairs_info));
> >
> > /* if we let the driver pick us queues learn which queues
> > * we were given
> > */
> > getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> > &qpairs_info, sizeof(qpairs_info));
>
> If ethtool -U is used to steer traffic to a specific descriptor queue,
> then the setsockopt can pass the exact id of that queue and there
> is no need for a getsockopt follow-up.
Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via
setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and
available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info
as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable.
But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need
to call getsocketopt.
>
> > /* And mmap queue pairs to user space */
> > mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
> > MAP_SHARED, fd, 0);
>
> How will packet data be mapped and how will userspace translate
> from paddr to vaddr? Is the goal to maintain long lived mappings
> and instruct drivers to allocate from this restricted range (to
> avoid per-packet system calls and vma operations)?
>
Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues
completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages
are translated to paddr) to fill in the packet descriptors.
As of security concern raised in previous discussion, the reason we think(BTW, correct me if I am wrong)
af_packet is most suitable is because only user application with root permission is allowed to successfully
split-off queue pairs and mmap a small window of PCIe I/O space to user space, so concern regarding "device
can DMA from/to any arbitrary physical memory." is not that big. As all user space device drivers based on
UIO mechanism has the same concern issue, VFIO adds protection but it is based on IOMMU which is
specific to Intel silicons.
> For throughput-oriented workloads, the syscall overhead
> involved in kicking the nic (on tx, or for increasing the ring
> consumer index on rx) can be amortized. And the operation
> can perhaps piggy-back on interrupts or other events
> (as long as interrupts are not disabled for full userspace
> polling). Latency would be harder to satisfy while maintaining
> some kernel policy enforcement. An extreme solution
> uses an asynchronously busy polling kernel worker thread
> (at high cycle cost, so acceptable for few workloads).
>
> When keeping the kernel in the loop, it is possible to do
> some basic sanity checking and transparently translate between
> vaddr and paddr, even when exposing the hardware descriptors
> directly. Though at this point it may be just as cheap to expose
> an idealized virtualized descriptor format and copy fields between
> that and device descriptors.
>
> One assumption underlying exposing the hardware descriptors
> is that they are quire similar between devices. How true is this
> in the context of formats that span multiple descriptors?
>
Packet descriptors format varies for different vendors. On Intel NICs, 1G/10G/40G NICs have
totally different formats. Even a same Intel 10G/40G NIC supports at least 2 different descriptor
formats. IMHO, the idea behind those patches intent to skip descriptor difference among devices,
as it just maps certain I/O space pages to user space and user space "slave" NIC driver can handle
it using different descriptor struct based on vendor/device ID. But I am open to add support of generic
packet descriptor format description, per David M' suggestion.
> > + * int (*ndo_split_queue_pairs) (struct net_device *dev,
> > + * unsigned int qpairs_start_from,
> > + * unsigned int qpairs_num,
> > + * struct sock *sk)
> > + * Called to request a set of queues from the driver to be
> > + * handed to the callee for management. After this returns the
> > + * driver will not use the queues.
>
> Are these queues also taken out of ethtool management, or is
> this equivalent to taking removing them from the rss set with
> ethtool -X?
As a master driver, the NIC kernel driver still takes control of flow director as a ethtool backend. Generally,
not all queues are initialized and used by NIC kernel driver, which reports actually used rx/tx numbers to stacks.
Before splitting off certain queues, if you want use ethtool to direct traffics to those unused queues,
ethtool reports invalid argument. Once certain stack-unaware queues are allocated for user space slave driver,
ethtool allows directing packets to them as the NIC driver maintains a data struct about which queues are visible
and used by kernel, which are used by user space.
Powered by blists - more mailing lists