netdev - Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2d66b2fd-ae64-f5c6-6d55-5e4cdf7931c8@gmail.com>
Date:   Tue, 20 Feb 2018 08:52:59 -0800
From:   John Fastabend <john.fastabend@...il.com>
To:     Jesper Dangaard Brouer <brouer@...hat.com>
Cc:     Jason Wang <jasowang@...hat.com>,
        "Michael S. Tsirkin" <mst@...hat.com>, netdev@...r.kernel.org,
        Alexei Starovoitov <alexei.starovoitov@...il.com>,
        Daniel Borkmann <borkmann@...earbox.net>,
        "David S. Miller" <davem@...emloft.net>,
        Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in
 receive_mergeable() case

On 02/20/2018 03:17 AM, Jesper Dangaard Brouer wrote:
> On Fri, 16 Feb 2018 09:19:02 -0800
> John Fastabend <john.fastabend@...il.com> wrote:
> 
>> On 02/16/2018 07:41 AM, Jesper Dangaard Brouer wrote:
>>> On Fri, 16 Feb 2018 13:31:37 +0800
>>> Jason Wang <jasowang@...hat.com> wrote:
>>>   
>>>> On 2018年02月16日 06:43, Jesper Dangaard Brouer wrote:  
>>>>> The virtio_net code have three different RX code-paths in receive_buf().
>>>>> Two of these code paths can handle XDP, but one of them is broken for
>>>>> at least XDP_REDIRECT.
>>>>>
>>>>> Function(1): receive_big() does not support XDP.
>>>>> Function(2): receive_small() support XDP fully and uses build_skb().
>>>>> Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb().
>>>>>
>>>>> The simple explanation is that receive_mergeable() is broken because
>>>>> it uses napi_alloc_skb(), which violates XDP given XDP assumes packet
>>>>> header+data in single page and enough tail room for skb_shared_info.
>>>>>
>>>>> The longer explaination is that receive_mergeable() tries to
>>>>> work-around and satisfy these XDP requiresments e.g. by having a
>>>>> function xdp_linearize_page() that allocates and memcpy RX buffers
>>>>> around (in case packet is scattered across multiple rx buffers).  This
>>>>> does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because
>>>>> we have not implemented bpf_xdp_adjust_tail yet).
>>>>>
>>>>> The XDP_REDIRECT action combined with cpumap is broken, and cause hard
>>>>> to debug crashes.  The main issue is that the RX packet does not have
>>>>> the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing
>>>>> skb_shared_info to overlap the next packets head-room (in which cpumap
>>>>> stores info).
>>>>>
>>>>> Reproducing depend on the packet payload length and if RX-buffer size
>>>>> happened to have tail-room for skb_shared_info or not.  But to make
>>>>> this even harder to troubleshoot, the RX-buffer size is runtime
>>>>> dynamically change based on an Exponentially Weighted Moving Average
>>>>> (EWMA) over the packet length, when refilling RX rings.
>>>>>
>>>>> This patch only disable XDP_REDIRECT support in receive_mergeable()
>>>>> case, because it can cause a real crash.
>>>>>
>>>>> But IMHO we should NOT support XDP in receive_mergeable() at all,
>>>>> because the principles behind XDP are to gain speed by (1) code
>>>>> simplicity, (2) sacrificing memory and (3) where possible moving
>>>>> runtime checks to setup time.  These principles are clearly being
>>>>> violated in receive_mergeable(), that e.g. runtime track average
>>>>> buffer size to save memory consumption.    
>>>>
>>>> I agree to disable it for -net now.   
>>>
>>> Okay... I'll send an official patch later.
>>>   
>>>> For net-next, we probably can do:
>>>>
>>>> - drop xdp_linearize_page() and do XDP through generic XDP helper
>>>>   after skb was built  
>>>
>>> I disagree strongly here - it makes no sense.
>>>
>>> Why do you want to explicit fallback to Generic-XDP?
>>> (... then all the performance gain is gone!)
>>> And besides, a couple of function calls later, the generic XDP code
>>> will/can get invoked anyhow...
>>>   
>>
>> Hi, Can we get EWMA to ensure for majority of cases we have the extra
>> head room? Seems we could just over-estimate the size by N-bytes. In
>> some cases we may under-estimate and then would need to fall back to
>> generic-xdp or otherwise growing the buffer which of course would be
>> painful and slow, but presumably would happen rarely.
> 
> Hmmm... (first of all it is missing tail-room not head-room).
> Second having all this extra size estimating code, and fallback options
> leaves a very bad taste in my mouth... this sounds like a sure way to
> kill performance.
> 
> 
>> I think it would be much better to keep this feature vs kill it and
>> make its configuration even more painful to get XDP working on virtio.
> 
> Based on you request, I'm going to fixing as much as possible of the
> XDP code path in driver virtio_net... I now have 4 fix patches...
> 

Thanks a lot!

> There is no way around disabling XDP_REDIRECT in receive_mergeable(),
> as XDP does not have a way to detect/know the "data_hard_end" of the
> data "frame".
> 
>> Disabling EWMA also seems reasonable to me.
> 
> To me, it seems more reasonable to have a separate RX function call
> when an XDP program gets attached, and in that process change to the
> memory model so it is compatible with XDP.
> 

I would be OK with that but would be curious to see what Jason and
Michael think. When I original wrote the XDP for virtio support the
XDP infra was still primitive and we didn't have metadata, cpu maps,
etc. yet. I suspect there might need to be some additional coordination
between guest and host though to switch the packet modes. If I recall
this was where some of the original trouble came from.

>  
>>> Take a step back:
>>>  What is the reason/use-case for implementing XDP inside virtio_net?
>>>
>>> From a DDoS/performance perspective XDP in virtio_net happens on the
>>> "wrong-side" as it is activated _inside_ the guest OS, which is too
>>> late for a DDoS filter, as the guest kick/switch overhead have already
>>> occurred.
>>>   
>>
>> The hypervisor may not "know" how to detect DDOS if its specific to
>> the VMs domain. In these cases we aren't measuring pps but are looking
>> at cycles/packet. In this case dropping cycles/packet frees up CPU
>> for useful work. Here I expect we can see real CPU % drop in VM by
>> using XDP.
> 
> Well, my benchmarks show that the bottleneck is not the cycles spend or
> save in RX code, but the bottleneck is in transitions into the
> hypervistor/host-OS... specifically I tested that increasing the
> guest/virtio_net NAPI-poll budget and avoid a emulated PCI-notify in
> virtio_ring.c (vq->notify), then I get much much better numbers...
> 

Interesting. Another thing that might come up in a bit is AF_XDP
in virtio. I suspect that is going to be a very interesting use
case for some folks. But, lots of things need to fall into place
before we get there.

Looks like there will be an Linux Plumbers Conference networking
track again. Maybe we should consider a virtualization session (BOF?)
if there is enough interest.

> 
>> The other use case is once we have a fast path NIC to VM in kernel
>> we can expect, from you numbers below, 3+Mpps. I seem to remember
>> from Jason's netdevconf talk that he had some experimental numbers
>> that were even better. The other case is the hypervisor is not Linux
>> and is feeding packets even faster DPDK numbers, again from Jason's
>> slides, seemed to show 11+Mpps. XDP makes a lot of sense here IMO.
> 
> If the hypervisor schedule the guest better, or DPDK delivery setup
> makes the guest busy-poll then sure, we are going to see better
> numbers, and _then_ the cycles saved by XDP-ddos-drop will worth it.
> 
> 
>> (those specific pps numbers I pulled from memory but the point is
>> feeding many Mpps into a VM should be doable)
>>
>> The last thing is we may see hardware VFs emulating virtio at some
>> point. Then XDP would be needed. With the newer virtio spec under
>> development my impression is the hardware emulation piece is becoming
>> more of a focus. But, someone actually working on that could probably
>> provide a more informed comment.
>>
>>> I do use XDP_DROP inside the guest (driver virtio_net), but just to
>>> perform what I can zoom-in benchmarking, for perf-record isolating the
>>> early RX code path in the guest.  (Using iptables "raw" table drop is
>>> almost as useful for that purpose).
>>>   
>>
>> I suspect customers will eventually start using this in VMs for
>> real use cases once the infrastructure and capabilities mature and
>> kernel versions in deployed VMs catch up.
>>
>>>
>>> 	
>>> The XDP ndo_xdp_xmit in tuntap/tun.c (that you also implemented) is
>>> significantly more interesting.  As it allow us to skip large parts of
>>> the network stack and redirect from a physical device (ixgbe) into a
>>> guest device.  Ran a benchmark:
>>>  - 0.5 Mpps with normal code path into device with driver tun
>>>  - 3.7 Mpps with XDP_REDIRECT from ixgbe into same device  
>>
>> Yep also very interesting but a different use case. This is accelerating
>> the hypervisor vswitch. The above is accelerating the VM domain.
> 
> Yes, you also brought this up above... IMHO as the first step we need
> to accelerate Linux "hypervisor-vswitch" performance into the guest,
> and afterwards we can add XDP optimizations inside the guest.  (I do
> acknowledge that another hypervisor than Linux might already have
> solved this already, and that DPDK already have a faster delivery trick
> into VMs/virtio ... but Linux needs this too).
> 

I agree. I just want to avoid ripping out what we have in virtio if
it can be avoided by fixing issues. Looks like you were able to
fix the major issues at least!

As an aside I suspect looking at AF_XDP as a backend for virtio may
prove fruitful. But that is just a hunch I have zero data.

>>>
>>> Plus, there are indications that 3.7Mpps is not the real limit, as
>>> guest CPU doing XDP_DROP is 75% idle... thus this is a likely a
>>> scheduling + queue size issue.
>>>   
>>
>> The target I had in mind is about 6Mpps L2fwd tests.
> 
> I did another L2fwd test inside the guest, via XDP_REDIRECT, but that
> was rather depressing...
> 
> The first trick to get more packets per sec than Linux can currently deliver
> into a virtio_net device was to use input ixgbe and XDP_REDIRECT into a
> tuntap/tun.c device's ndo_xdp_xmit() as described above I get approx 4Mpps.
> 
>  10,609,516 pkt/s = RX ixgbe redirect into tun ndo_xdp_xmit()
>   4,063,029 pkt/s = Seen by XDP in guest trying to XDP_REDIRECT
>   1,327,469 pkt/s = Actual TX seen on another virtio_net device
>   1,469,671 pkt/s = Notice normal Linux forwarding (IpForwDatagrams)                 
> 
> Thus, the XDP_REDIRECT inside the guest is actually SLOWER than normal
> Linux routing... It looks like the major cost or bottleneck lies in
> virtnet_xdp_xmit (ndo_xdp_xmit).  It seem that virtqueue_add() supports
> sending more frames at a time, but virtnet_xdp_xmit() only send a
> single frame.  This would actually be a good argument for changing the
> ndo_xdp_xmit API to support bulking...
> 
More work :) But I don't see any reason, other than engineering effort,
it can't be made to work.

FWIW on the topic of xdp redirect it would be great if we can get the
redirect API in some shape so it can be consumed by other NIC driver.
Thanks for doing this though, interesting stuff.

Thanks,
John