netdev - Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56A69A79.4080701@gmail.com>
Date:	Mon, 25 Jan 2016 13:58:17 -0800
From:	John Fastabend <john.fastabend@...il.com>
To:	Tom Herbert <tom@...bertland.com>
CC:	Jesper Dangaard Brouer <brouer@...hat.com>,
	"Michael S. Tsirkin" <mst@...hat.com>,
	David Miller <davem@...emloft.net>,
	Eric Dumazet <eric.dumazet@...il.com>,
	Or Gerlitz <gerlitz.or@...il.com>,
	Eric Dumazet <edumazet@...gle.com>,
	Linux Kernel Network Developers <netdev@...r.kernel.org>,
	Alexander Duyck <alexander.duyck@...il.com>,
	Alexei Starovoitov <alexei.starovoitov@...il.com>,
	Daniel Borkmann <borkmann@...earbox.net>,
	Marek Majkowski <marek@...udflare.com>,
	Hannes Frederic Sowa <hannes@...essinduktion.org>,
	Florian Westphal <fw@...len.de>,
	Paolo Abeni <pabeni@...hat.com>,
	John Fastabend <john.r.fastabend@...el.com>,
	Amir Vadai <amirva@...il.com>,
	Daniel Borkmann <daniel@...earbox.net>,
	Vladislav Yasevich <vyasevich@...il.com>
Subject: Re: Bypass at packet-page level (Was: Optimizing instruction-cache,
 more packets at each stage)

On 16-01-25 01:32 PM, Tom Herbert wrote:
> On Mon, Jan 25, 2016 at 9:50 AM, John Fastabend
> <john.fastabend@...il.com> wrote:
>> On 16-01-25 09:09 AM, Tom Herbert wrote:
>>> On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer
>>> <brouer@...hat.com> wrote:
>>>>
>>>> After reading John's reply about perfect filters, I want to re-state
>>>> my idea, for this very early RX stage.  And describe a packet-page
>>>> level bypass use-case, that John indirectly mentions.
>>>>
>>>>
>>>> There are two ideas, getting mixed up here.  (1) bundling from the
>>>> RX-ring, (2) allowing to pick up the "packet-page" directly.
>>>>
>>>> Bundling (1) is something that seems natural, and which help us
>>>> amortize the cost between layers (and utilizes icache better). Lets
>>>> keep that in another thread.
>>>>
>>>> This (2) direct forward of "packet-pages" is a fairly extreme idea,
>>>> BUT it have the potential of being an new integration point for
>>>> "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to
>>>> speed with bypass-solutions.
>>>>
>>>>
>>>> Today, the bypass-solutions grab and control the entire NIC HW.  In
>>>> many cases this is not very practical, if you also want to use the NIC
>>>> for something else.
>>>>
>>>> Solutions for bypassing only part of the traffic is starting to show
>>>> up.  Both a netmap[1] and a DPDK[2] based approach.
>>>>
>>>> [1] https://blog.cloudflare.com/partial-kernel-bypass-merged-netmap/
>>>> [2] http://rhelblog.redhat.com/2015/10/02/getting-the-best-of-both-worlds-with-queue-splitting-bifurcated-driver/
>>>>
>>>> Both approaches install a HW filter in the NIC, and redirect packets
>>>> to a separate RX HW queue (via ethtool ntuple + flow-type).  DPDK
>>>> needs pci SRIOV setup and then run it own poll-mode driver on top.
>>>> Netmap patch the orig ixgbe driver, and since CloudFlare/Gilberto's
>>>> changes[3] support a single RX queue mode.
>>>>
>>
>> FWIW I wrote a version of the patch talked about in the queue splitting
>> article that didn't require SR-IOV and we also talked about it at last
>> netconf in ottowa. The problem is without SR-IOV if you map a queue
>> directly into userspace so you can run the poll mode drivers there is
>> nothing protecting the DMA engine. So userspace can put arbitrary
>> addresses in there. There is something called Process Address Space ID
>> (PASID) also part of the PCI-SIG spec that could help you here but I
>> don't know of any hardware that supports it. The other option is to
>> use system calls and validate the descriptors in the kernel but this
>> incurs some overhead we had it at 15% or so when I did the numbers
>> last year. However I'm told there is some interesting work going on
>> around syscall overhead that may help.
>>
>> One thing to note is SRIOV does somewhat limit the number of these
>> types of interfaces you can support to the max VFs where as the
>> queue mechanism although slower with a function call would be limited
>> to max number of queues. Also busy polling will help here if you
>> are worried about pps.
>>
> I think you're understating that a bit :-) We know that busy polling
> helps with both pps and latency. IIRC, busy polling in the kernel
> reduced latency by 2/3. Any latency or pps comparison between an
> interrupt driven kernel stack and a userspace stack doing polling
> would be invalid. If this work is all about latency (like burning
> cores is not an issue), maybe busy polling should be be assumed for
> all test cases?

Probably if your going to try and report pps numbers and chart them
we mind as well play the game and use the best configuration we can.

Although I did want to make busy polling per queue or maybe create
L3/L4 netdev's like macvlan and put those in busy polling. Its a bit
overkill to put the entire device in busy polling mode when we have
only a couple sockets doing it. net-next is opening soon right ;)

> 
>> Jesper, at least for you (2) case what are we missing with the
>> bifurcated/queue splitting work? Are you really after systems
>> without SR-IOV support or are you trying to get this on the order
>> of queues instead of VFs.
>>
>>> Jepser, thanks for providing more specifics.
>>>
>>> One comment: If you intend to change core code paths or APIs for this,
>>> then I think that we should require up front that the associated HW
>>> support is protocol agnostic (i.e. HW filters must be programmable and
>>> generic ). We don't want a promising feature like this to be
>>> undermined by protocol ossification.
>>
>> At the moment we use ethtool ntuple filters which is basically adding
>> a new set of enums and structures every time we need a new protocol
>> so its painful and you need your vendor to support you and you need a
>> new kernel.
>>
>> The flow api was shot down (which would get you to the point where
>> the user could specify the protocols for the driver to implement e.g.
>> put_parse_graph) and the only new proposals I've seen are bpf
>> translations in drivers and 'tc'. I plan to take another shot at this in
>> net-next.
>>
>>>
>>> Thanks,
>>> Tom
>>>
>>>> [3] https://github.com/luigirizzo/netmap/pull/87
>>>>
>>