netdev - Re: Questions on XDP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <58A8D88B.3000907@gmail.com>
Date:   Sat, 18 Feb 2017 15:28:11 -0800
From:   John Fastabend <john.fastabend@...il.com>
To:     Alexander Duyck <alexander.duyck@...il.com>,
        Eric Dumazet <eric.dumazet@...il.com>
Cc:     Jesper Dangaard Brouer <brouer@...hat.com>,
        Netdev <netdev@...r.kernel.org>,
        Tom Herbert <tom@...bertland.com>,
        Alexei Starovoitov <ast@...nel.org>,
        John Fastabend <john.r.fastabend@...el.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        David Miller <davem@...emloft.net>
Subject: Re: Questions on XDP

On 17-02-18 10:18 AM, Alexander Duyck wrote:
> On Sat, Feb 18, 2017 at 9:41 AM, Eric Dumazet <eric.dumazet@...il.com> wrote:
>> On Sat, 2017-02-18 at 17:34 +0100, Jesper Dangaard Brouer wrote:
>>> On Thu, 16 Feb 2017 14:36:41 -0800
>>> John Fastabend <john.fastabend@...il.com> wrote:
>>>
>>>> On 17-02-16 12:41 PM, Alexander Duyck wrote:
>>>>> So I'm in the process of working on enabling XDP for the Intel NICs
>>>>> and I had a few questions so I just thought I would put them out here
>>>>> to try and get everything sorted before I paint myself into a corner.
>>>>>
>>>>> So my first question is why does the documentation mention 1 frame per
>>>>> page for XDP?
>>>
>>> Yes, XDP defines upfront a memory model where there is only one packet
>>> per page[1], please respect that!
>>>
>>> This is currently used/needed for fast-direct recycling of pages inside
>>> the driver for XDP_DROP and XDP_TX, _without_ performing any atomic
>>> refcnt operations on the page. E.g. see mlx4_en_rx_recycle().

Alex, does your pagecnt_bias trick resolve this? It seems to me that the
recycling is working in ixgbe patches just fine (at least I never see the
allocator being triggered with simple XDP programs). The biggest win for
me right now is to avoid the dma mapping operations.

>>
>>
>> XDP_DROP does not require having one page per frame.
> 
> Agreed.
> 
>> (Look after my recent mlx4 patch series if you need to be convinced)
>>
>> Only XDP_TX is.

I'm still not sure what page per packet buys us on XDP_TX. What was the
explanation again?

>>
>> This requirement makes XDP useless (very OOM likely) on arches with 64K
>> pages.
> 
> Actually I have been having a side discussion with John about XDP_TX.
> Looking at the Mellanox way of doing it I am not entirely sure it is
> useful.  It looks good for benchmarks but that is about it.  Also I
> don't see it extending out to the point that we would be able to
> exchange packets between interfaces which really seems like it should
> be the ultimate goal for XDP_TX.

This is needed if we want XDP to be used for vswitch use cases. We have
a patch running on virtio but really need to get it working on real
hardware before we push it.

> 
> It seems like eventually we want to be able to peel off the buffer and
> send it to something other than ourselves.  For example it seems like
> it might be useful at some point to use XDP to do traffic
> classification and have it route packets between multiple interfaces
> on a host and it wouldn't make sense to have all of them map every
> page as bidirectional because it starts becoming ridiculous if you
> have dozens of interfaces in a system.
> 
> As per our original discussion at netconf if we want to be able to do
> XDP Tx with a fully lockless Tx ring we needed to have a Tx ring per
> CPU that is performing XDP.  The Tx path will end up needing to do the
> map/unmap itself in the case of physical devices but the expense of
> that can be somewhat mitigated on x86 at least by either disabling the
> IOMMU or using identity mapping.  I think this might be the route
> worth exploring as we could then start looking at doing things like
> implementing bridges and routers in XDP and see what performance gains
> can be had there.

One issue I have with TX ring per CPU per device is in my current use
case I have 2k tap/vhost devices and need to scale up to more than that.
Taking the naive approach and making each tap/vhost create a per cpu
ring would be 128k rings on my current dev box. I think locking could
be optional without too much difficulty.

> 
> Also as far as the one page per frame it occurs to me that you will
> have to eventually deal with things like frame replication.  Once that
> comes into play everything becomes much more difficult because the
> recycling doesn't work without some sort of reference counting, and
> since the device interrupt can migrate you could end up with clean-up
> occurring on a different CPUs so you need to have some sort of
> synchronization mechanism.
> 
> Thanks.
> 
> - Alex
>