lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <54340CEB.204@intel.com>
Date:	Tue, 07 Oct 2014 08:55:23 -0700
From:	John Fastabend <john.r.fastabend@...el.com>
To:	Willem de Bruijn <willemb@...gle.com>,
	"Zhou, Danny" <danny.zhou@...el.com>
CC:	John Fastabend <john.fastabend@...il.com>,
	Daniel Borkmann <dborkman@...hat.com>,
	Florian Westphal <fw@...len.de>,
	"gerlitz.or@...il.com" <gerlitz.or@...il.com>,
	Hannes Frederic Sowa <hannes@...essinduktion.org>,
	Network Development <netdev@...r.kernel.org>,
	"Ronciak, John" <john.ronciak@...el.com>,
	Amir Vadai <amirv@...lanox.com>,
	Eric Dumazet <eric.dumazet@...il.com>
Subject: Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct
 ring access

On 10/07/2014 08:46 AM, Willem de Bruijn wrote:
>>>> Typically in an af_packet interface a packet_type handler is
>>>> registered and used to filter traffic to the socket and do other
>>>> things such as fan out traffic to multiple sockets. In this case the
>>>> networking stack is being bypassed so this code is not run. So the
>>>> hardware must push the correct traffic to the queues obtained from
>>>> the ndo callback ndo_split_queue_pairs().
>>>
>>> Why does the interface work at the level of queue_pairs instead of
>>> individual queues?
>>
>> The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O
>> on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and
>> egress traffics respectively, although the flow director only applies to ingress traffics.
> 
> That requirement of co-allocation is absent in existing packet
> rings. Many applications only receive or transmit. For
> receive-only, it would even be possible to map descriptor
> rings read-only, if the kernel remains responsible for posting
> buffers -- but I see below that that is not the case, so that's
> not very relevant here.
> 
> Still, some workloads want asymmetric sets of rx and tx rings.
> For instance, instead of using RSS, a process may want to
> receive on as few rings as possible, load balance across
> workers in software, but still give each worker thread its own
> private transmit ring.
> 

We can build this into the interface by having the setsockopt
provide both the number of tx rings and number of rx rings. It
might not be immediately available in any drivers because at
least ixgbe is pretty dependent on tx/rx pairing.

I would have to look through the other drivers to see how
much work it would be to support this on them. If I can't find
a good candidate we might leave it out until we can fix up the
drivers.

> 
>>>
>>>>         /* Get the layout of ring space offset, page_sz, cnt */
>>>>         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
>>>>                    &info, &optlen);
>>>>
>>>>         /* request some queues from the driver */
>>>>         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>>>>                    &qpairs_info, sizeof(qpairs_info));
>>>>
>>>>         /* if we let the driver pick us queues learn which queues
>>>>          * we were given
>>>>          */
>>>>         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>>>>                    &qpairs_info, sizeof(qpairs_info));
>>>
>>> If ethtool -U is used to steer traffic to a specific descriptor queue,
>>> then the setsockopt can pass the exact id of that queue and there
>>> is no need for a getsockopt follow-up.
>>
>> Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via
>> setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and
>> available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
>> Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info
>> as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable.
>> But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need
>> to call getsocketopt.
> 
> One step further would be to move the entire configuration behind
> the packet socket interface. It's perhaps out of scope of this patch,
> but the difference between using `ethtool -U` and passing the same
> expression through the packet socket is that in the latter case the
> kernel can automatically rollback the configuration change when the
> process dies.
> 

hmm might be interesting I think  this is a follow on path to
investigate after the initial support.

>>
>>>
>>>>         /* And mmap queue pairs to user space */
>>>>         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
>>>>              MAP_SHARED, fd, 0);
>>>
>>> How will packet data be mapped and how will userspace translate
>>> from paddr to vaddr? Is the goal to maintain long lived mappings
>>> and instruct drivers to allocate from this restricted range (to
>>> avoid per-packet system calls and vma operations)?
>>>
>>
>> Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues
>> completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages
>> are translated to paddr) to fill in the packet descriptors.
> 
> Ah, userspace is responsible for posting buffers and translation
> from vaddr to paddr is straightforward. Yes that makes sense.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ