netdev - Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8de84cd7-26f0-37f5-e655-21fbb40234ea@redhat.com>
Date:   Thu, 22 Feb 2018 13:50:30 +0800
From:   Jason Wang <jasowang@...hat.com>
To:     John Fastabend <john.fastabend@...il.com>,
        Jesper Dangaard Brouer <brouer@...hat.com>
Cc:     "Michael S. Tsirkin" <mst@...hat.com>, netdev@...r.kernel.org,
        Alexei Starovoitov <alexei.starovoitov@...il.com>,
        Daniel Borkmann <borkmann@...earbox.net>,
        "David S. Miller" <davem@...emloft.net>,
        Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in
 receive_mergeable() case



On 2018年02月21日 00:52, John Fastabend wrote:
> On 02/20/2018 03:17 AM, Jesper Dangaard Brouer wrote:
>> On Fri, 16 Feb 2018 09:19:02 -0800
>> John Fastabend <john.fastabend@...il.com> wrote:
>>
>>> On 02/16/2018 07:41 AM, Jesper Dangaard Brouer wrote:
>>>> On Fri, 16 Feb 2018 13:31:37 +0800
>>>> Jason Wang <jasowang@...hat.com> wrote:
>>>>    
>>>>> On 2018年02月16日 06:43, Jesper Dangaard Brouer wrote:
>>>>>> The virtio_net code have three different RX code-paths in receive_buf().
>>>>>> Two of these code paths can handle XDP, but one of them is broken for
>>>>>> at least XDP_REDIRECT.
>>>>>>
>>>>>> Function(1): receive_big() does not support XDP.
>>>>>> Function(2): receive_small() support XDP fully and uses build_skb().
>>>>>> Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb().
>>>>>>
>>>>>> The simple explanation is that receive_mergeable() is broken because
>>>>>> it uses napi_alloc_skb(), which violates XDP given XDP assumes packet
>>>>>> header+data in single page and enough tail room for skb_shared_info.
>>>>>>
>>>>>> The longer explaination is that receive_mergeable() tries to
>>>>>> work-around and satisfy these XDP requiresments e.g. by having a
>>>>>> function xdp_linearize_page() that allocates and memcpy RX buffers
>>>>>> around (in case packet is scattered across multiple rx buffers).  This
>>>>>> does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because
>>>>>> we have not implemented bpf_xdp_adjust_tail yet).
>>>>>>
>>>>>> The XDP_REDIRECT action combined with cpumap is broken, and cause hard
>>>>>> to debug crashes.  The main issue is that the RX packet does not have
>>>>>> the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing
>>>>>> skb_shared_info to overlap the next packets head-room (in which cpumap
>>>>>> stores info).
>>>>>>
>>>>>> Reproducing depend on the packet payload length and if RX-buffer size
>>>>>> happened to have tail-room for skb_shared_info or not.  But to make
>>>>>> this even harder to troubleshoot, the RX-buffer size is runtime
>>>>>> dynamically change based on an Exponentially Weighted Moving Average
>>>>>> (EWMA) over the packet length, when refilling RX rings.
>>>>>>
>>>>>> This patch only disable XDP_REDIRECT support in receive_mergeable()
>>>>>> case, because it can cause a real crash.
>>>>>>
>>>>>> But IMHO we should NOT support XDP in receive_mergeable() at all,
>>>>>> because the principles behind XDP are to gain speed by (1) code
>>>>>> simplicity, (2) sacrificing memory and (3) where possible moving
>>>>>> runtime checks to setup time.  These principles are clearly being
>>>>>> violated in receive_mergeable(), that e.g. runtime track average
>>>>>> buffer size to save memory consumption.
>>>>> I agree to disable it for -net now.
>>>> Okay... I'll send an official patch later.
>>>>    
>>>>> For net-next, we probably can do:
>>>>>
>>>>> - drop xdp_linearize_page() and do XDP through generic XDP helper
>>>>>    after skb was built
>>>> I disagree strongly here - it makes no sense.
>>>>
>>>> Why do you want to explicit fallback to Generic-XDP?
>>>> (... then all the performance gain is gone!)
>>>> And besides, a couple of function calls later, the generic XDP code
>>>> will/can get invoked anyhow...
>>>>    
>>> Hi, Can we get EWMA to ensure for majority of cases we have the extra
>>> head room? Seems we could just over-estimate the size by N-bytes. In
>>> some cases we may under-estimate and then would need to fall back to
>>> generic-xdp or otherwise growing the buffer which of course would be
>>> painful and slow, but presumably would happen rarely.
>> Hmmm... (first of all it is missing tail-room not head-room).
>> Second having all this extra size estimating code, and fallback options
>> leaves a very bad taste in my mouth... this sounds like a sure way to
>> kill performance.
>>
>>
>>> I think it would be much better to keep this feature vs kill it and
>>> make its configuration even more painful to get XDP working on virtio.
>> Based on you request, I'm going to fixing as much as possible of the
>> XDP code path in driver virtio_net... I now have 4 fix patches...
>>
> Thanks a lot!
>
>> There is no way around disabling XDP_REDIRECT in receive_mergeable(),
>> as XDP does not have a way to detect/know the "data_hard_end" of the
>> data "frame".
>>
>>> Disabling EWMA also seems reasonable to me.
>> To me, it seems more reasonable to have a separate RX function call
>> when an XDP program gets attached, and in that process change to the
>> memory model so it is compatible with XDP.
>>
> I would be OK with that but would be curious to see what Jason and
> Michael think. When I original wrote the XDP for virtio support the
> XDP infra was still primitive and we didn't have metadata, cpu maps,
> etc.

Yes, that why cpumap fails.

>   yet. I suspect there might need to be some additional coordination
> between guest and host though to switch the packet modes. If I recall
> this was where some of the original trouble came from.

Maybe, but I think just have a separate refill function should be 
sufficient? Then we could reuse the exist code to deal with e.g 
synchronization.

>
>>   
>>>> Take a step back:
>>>>   What is the reason/use-case for implementing XDP inside virtio_net?
>>>>
>>>>  From a DDoS/performance perspective XDP in virtio_net happens on the
>>>> "wrong-side" as it is activated _inside_ the guest OS, which is too
>>>> late for a DDoS filter, as the guest kick/switch overhead have already
>>>> occurred.
>>>>    
>>> The hypervisor may not "know" how to detect DDOS if its specific to
>>> the VMs domain. In these cases we aren't measuring pps but are looking
>>> at cycles/packet. In this case dropping cycles/packet frees up CPU
>>> for useful work. Here I expect we can see real CPU % drop in VM by
>>> using XDP.
>> Well, my benchmarks show that the bottleneck is not the cycles spend or
>> save in RX code, but the bottleneck is in transitions into the
>> hypervistor/host-OS... specifically I tested that increasing the
>> guest/virtio_net NAPI-poll budget and avoid a emulated PCI-notify in
>> virtio_ring.c (vq->notify), then I get much much better numbers...
>>
> Interesting.

Yes, this sounds like a increasing of batched packets during TX.

> Another thing that might come up in a bit is AF_XDP
> in virtio. I suspect that is going to be a very interesting use
> case for some folks. But, lots of things need to fall into place
> before we get there.

Another point of view, the function of AF_XDP and 
tap+vhost_net+XDP_REDIRECT looks rather similar. The only differences is 
the API for userspace. It would be interesting to compare the 
performance between them (non zerocopy mode).

>
> Looks like there will be an Linux Plumbers Conference networking
> track again. Maybe we should consider a virtualization session (BOF?)
> if there is enough interest.
>
>>> The other use case is once we have a fast path NIC to VM in kernel
>>> we can expect, from you numbers below, 3+Mpps. I seem to remember
>>> from Jason's netdevconf talk that he had some experimental numbers
>>> that were even better. The other case is the hypervisor is not Linux
>>> and is feeding packets even faster DPDK numbers, again from Jason's
>>> slides, seemed to show 11+Mpps. XDP makes a lot of sense here IMO.
>> If the hypervisor schedule the guest better, or DPDK delivery setup
>> makes the guest busy-poll then sure, we are going to see better
>> numbers, and _then_ the cycles saved by XDP-ddos-drop will worth it.
>>
>>
>>> (those specific pps numbers I pulled from memory but the point is
>>> feeding many Mpps into a VM should be doable)
>>>
>>> The last thing is we may see hardware VFs emulating virtio at some
>>> point. Then XDP would be needed. With the newer virtio spec under
>>> development my impression is the hardware emulation piece is becoming
>>> more of a focus. But, someone actually working on that could probably
>>> provide a more informed comment.

FYI, here a talk in last KVM forum about hardware virtio support:

https://events.static.linuxfound.org/sites/events/files/slides/KVM17%27-vDPA.pdf

>>>
>>>> I do use XDP_DROP inside the guest (driver virtio_net), but just to
>>>> perform what I can zoom-in benchmarking, for perf-record isolating the
>>>> early RX code path in the guest.  (Using iptables "raw" table drop is
>>>> almost as useful for that purpose).
>>>>    
>>> I suspect customers will eventually start using this in VMs for
>>> real use cases once the infrastructure and capabilities mature and
>>> kernel versions in deployed VMs catch up.
>>>
>>>> 	
>>>> The XDP ndo_xdp_xmit in tuntap/tun.c (that you also implemented) is
>>>> significantly more interesting.  As it allow us to skip large parts of
>>>> the network stack and redirect from a physical device (ixgbe) into a
>>>> guest device.  Ran a benchmark:
>>>>   - 0.5 Mpps with normal code path into device with driver tun
>>>>   - 3.7 Mpps with XDP_REDIRECT from ixgbe into same device
>>> Yep also very interesting but a different use case. This is accelerating
>>> the hypervisor vswitch. The above is accelerating the VM domain.
>> Yes, you also brought this up above... IMHO as the first step we need
>> to accelerate Linux "hypervisor-vswitch" performance into the guest,
>> and afterwards we can add XDP optimizations inside the guest.  (I do
>> acknowledge that another hypervisor than Linux might already have
>> solved this already, and that DPDK already have a faster delivery trick
>> into VMs/virtio ... but Linux needs this too).
>>
> I agree. I just want to avoid ripping out what we have in virtio if
> it can be avoided by fixing issues. Looks like you were able to
> fix the major issues at least!
>
> As an aside I suspect looking at AF_XDP as a backend for virtio may
> prove fruitful. But that is just a hunch I have zero data.

That would requires some work on vhost which assumes a sendmsg/revcmsg 
API currently, I'm not sure whether or not we could do this easily. But 
it would be very interesting.

>
>>>> Plus, there are indications that 3.7Mpps is not the real limit, as
>>>> guest CPU doing XDP_DROP is 75% idle... thus this is a likely a
>>>> scheduling + queue size issue.
>>>>    
>>> The target I had in mind is about 6Mpps L2fwd tests.
>> I did another L2fwd test inside the guest, via XDP_REDIRECT, but that
>> was rather depressing...
>>
>> The first trick to get more packets per sec than Linux can currently deliver
>> into a virtio_net device was to use input ixgbe and XDP_REDIRECT into a
>> tuntap/tun.c device's ndo_xdp_xmit() as described above I get approx 4Mpps.
>>
>>   10,609,516 pkt/s = RX ixgbe redirect into tun ndo_xdp_xmit()
>>    4,063,029 pkt/s = Seen by XDP in guest trying to XDP_REDIRECT
>>    1,327,469 pkt/s = Actual TX seen on another virtio_net device
>>    1,469,671 pkt/s = Notice normal Linux forwarding (IpForwDatagrams)
>>
>> Thus, the XDP_REDIRECT inside the guest is actually SLOWER than normal
>> Linux routing... It looks like the major cost or bottleneck lies in
>> virtnet_xdp_xmit (ndo_xdp_xmit).  It seem that virtqueue_add() supports
>> sending more frames at a time, but virtnet_xdp_xmit() only send a
>> single frame.

Looks not AFAIK. It only supports adding one frame at a time.

I would try to pick some time to look at this.

Thanks

>>    This would actually be a good argument for changing the
>> ndo_xdp_xmit API to support bulking...
>>
> More work :) But I don't see any reason, other than engineering effort,
> it can't be made to work.
>
> FWIW on the topic of xdp redirect it would be great if we can get the
> redirect API in some shape so it can be consumed by other NIC driver.
> Thanks for doing this though, interesting stuff.
>
> Thanks,
> John