[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+FuTSe=vo1-Xpk+318SNc-mCH_c0WQadXo3usiA_dRBNx_fEQ@mail.gmail.com>
Date: Tue, 7 Oct 2014 11:46:05 -0400
From: Willem de Bruijn <willemb@...gle.com>
To: "Zhou, Danny" <danny.zhou@...el.com>
Cc: John Fastabend <john.fastabend@...il.com>,
Daniel Borkmann <dborkman@...hat.com>,
Florian Westphal <fw@...len.de>,
"gerlitz.or@...il.com" <gerlitz.or@...il.com>,
Hannes Frederic Sowa <hannes@...essinduktion.org>,
Network Development <netdev@...r.kernel.org>,
"Ronciak, John" <john.ronciak@...el.com>,
Amir Vadai <amirv@...lanox.com>,
Eric Dumazet <eric.dumazet@...il.com>
Subject: Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct
ring access
>> > Typically in an af_packet interface a packet_type handler is
>> > registered and used to filter traffic to the socket and do other
>> > things such as fan out traffic to multiple sockets. In this case the
>> > networking stack is being bypassed so this code is not run. So the
>> > hardware must push the correct traffic to the queues obtained from
>> > the ndo callback ndo_split_queue_pairs().
>>
>> Why does the interface work at the level of queue_pairs instead of
>> individual queues?
>
> The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O
> on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and
> egress traffics respectively, although the flow director only applies to ingress traffics.
That requirement of co-allocation is absent in existing packet
rings. Many applications only receive or transmit. For
receive-only, it would even be possible to map descriptor
rings read-only, if the kernel remains responsible for posting
buffers -- but I see below that that is not the case, so that's
not very relevant here.
Still, some workloads want asymmetric sets of rx and tx rings.
For instance, instead of using RSS, a process may want to
receive on as few rings as possible, load balance across
workers in software, but still give each worker thread its own
private transmit ring.
>>
>> > /* Get the layout of ring space offset, page_sz, cnt */
>> > getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
>> > &info, &optlen);
>> >
>> > /* request some queues from the driver */
>> > setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>> > &qpairs_info, sizeof(qpairs_info));
>> >
>> > /* if we let the driver pick us queues learn which queues
>> > * we were given
>> > */
>> > getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>> > &qpairs_info, sizeof(qpairs_info));
>>
>> If ethtool -U is used to steer traffic to a specific descriptor queue,
>> then the setsockopt can pass the exact id of that queue and there
>> is no need for a getsockopt follow-up.
>
> Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via
> setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and
> available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
> Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info
> as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable.
> But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need
> to call getsocketopt.
One step further would be to move the entire configuration behind
the packet socket interface. It's perhaps out of scope of this patch,
but the difference between using `ethtool -U` and passing the same
expression through the packet socket is that in the latter case the
kernel can automatically rollback the configuration change when the
process dies.
>
>>
>> > /* And mmap queue pairs to user space */
>> > mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
>> > MAP_SHARED, fd, 0);
>>
>> How will packet data be mapped and how will userspace translate
>> from paddr to vaddr? Is the goal to maintain long lived mappings
>> and instruct drivers to allocate from this restricted range (to
>> avoid per-packet system calls and vma operations)?
>>
>
> Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues
> completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages
> are translated to paddr) to fill in the packet descriptors.
Ah, userspace is responsible for posting buffers and translation
from vaddr to paddr is straightforward. Yes that makes sense.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists