[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <58A8D88B.3000907@gmail.com>
Date: Sat, 18 Feb 2017 15:28:11 -0800
From: John Fastabend <john.fastabend@...il.com>
To: Alexander Duyck <alexander.duyck@...il.com>,
Eric Dumazet <eric.dumazet@...il.com>
Cc: Jesper Dangaard Brouer <brouer@...hat.com>,
Netdev <netdev@...r.kernel.org>,
Tom Herbert <tom@...bertland.com>,
Alexei Starovoitov <ast@...nel.org>,
John Fastabend <john.r.fastabend@...el.com>,
Daniel Borkmann <daniel@...earbox.net>,
David Miller <davem@...emloft.net>
Subject: Re: Questions on XDP
On 17-02-18 10:18 AM, Alexander Duyck wrote:
> On Sat, Feb 18, 2017 at 9:41 AM, Eric Dumazet <eric.dumazet@...il.com> wrote:
>> On Sat, 2017-02-18 at 17:34 +0100, Jesper Dangaard Brouer wrote:
>>> On Thu, 16 Feb 2017 14:36:41 -0800
>>> John Fastabend <john.fastabend@...il.com> wrote:
>>>
>>>> On 17-02-16 12:41 PM, Alexander Duyck wrote:
>>>>> So I'm in the process of working on enabling XDP for the Intel NICs
>>>>> and I had a few questions so I just thought I would put them out here
>>>>> to try and get everything sorted before I paint myself into a corner.
>>>>>
>>>>> So my first question is why does the documentation mention 1 frame per
>>>>> page for XDP?
>>>
>>> Yes, XDP defines upfront a memory model where there is only one packet
>>> per page[1], please respect that!
>>>
>>> This is currently used/needed for fast-direct recycling of pages inside
>>> the driver for XDP_DROP and XDP_TX, _without_ performing any atomic
>>> refcnt operations on the page. E.g. see mlx4_en_rx_recycle().
Alex, does your pagecnt_bias trick resolve this? It seems to me that the
recycling is working in ixgbe patches just fine (at least I never see the
allocator being triggered with simple XDP programs). The biggest win for
me right now is to avoid the dma mapping operations.
>>
>>
>> XDP_DROP does not require having one page per frame.
>
> Agreed.
>
>> (Look after my recent mlx4 patch series if you need to be convinced)
>>
>> Only XDP_TX is.
I'm still not sure what page per packet buys us on XDP_TX. What was the
explanation again?
>>
>> This requirement makes XDP useless (very OOM likely) on arches with 64K
>> pages.
>
> Actually I have been having a side discussion with John about XDP_TX.
> Looking at the Mellanox way of doing it I am not entirely sure it is
> useful. It looks good for benchmarks but that is about it. Also I
> don't see it extending out to the point that we would be able to
> exchange packets between interfaces which really seems like it should
> be the ultimate goal for XDP_TX.
This is needed if we want XDP to be used for vswitch use cases. We have
a patch running on virtio but really need to get it working on real
hardware before we push it.
>
> It seems like eventually we want to be able to peel off the buffer and
> send it to something other than ourselves. For example it seems like
> it might be useful at some point to use XDP to do traffic
> classification and have it route packets between multiple interfaces
> on a host and it wouldn't make sense to have all of them map every
> page as bidirectional because it starts becoming ridiculous if you
> have dozens of interfaces in a system.
>
> As per our original discussion at netconf if we want to be able to do
> XDP Tx with a fully lockless Tx ring we needed to have a Tx ring per
> CPU that is performing XDP. The Tx path will end up needing to do the
> map/unmap itself in the case of physical devices but the expense of
> that can be somewhat mitigated on x86 at least by either disabling the
> IOMMU or using identity mapping. I think this might be the route
> worth exploring as we could then start looking at doing things like
> implementing bridges and routers in XDP and see what performance gains
> can be had there.
One issue I have with TX ring per CPU per device is in my current use
case I have 2k tap/vhost devices and need to scale up to more than that.
Taking the naive approach and making each tap/vhost create a per cpu
ring would be 128k rings on my current dev box. I think locking could
be optional without too much difficulty.
>
> Also as far as the one page per frame it occurs to me that you will
> have to eventually deal with things like frame replication. Once that
> comes into play everything becomes much more difficult because the
> recycling doesn't work without some sort of reference counting, and
> since the device interrupt can migrate you could end up with clean-up
> occurring on a different CPUs so you need to have some sort of
> synchronization mechanism.
>
> Thanks.
>
> - Alex
>
Powered by blists - more mailing lists