[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ad766caf-4656-befa-d55c-92d9c943fa15@oracle.com>
Date: Mon, 26 Mar 2018 15:54:18 -0700
From: Tushar Dave <tushar.n.dave@...cle.com>
To: Jesper Dangaard Brouer <brouer@...hat.com>,
William Tu <u9012063@...il.com>
Cc: Björn Töpel <bjorn.topel@...il.com>,
magnus.karlsson@...el.com,
Alexander Duyck <alexander.h.duyck@...el.com>,
Alexander Duyck <alexander.duyck@...il.com>,
John Fastabend <john.fastabend@...il.com>,
Alexei Starovoitov <ast@...com>,
willemdebruijn.kernel@...il.com,
Daniel Borkmann <daniel@...earbox.net>,
Linux Kernel Network Developers <netdev@...r.kernel.org>,
Björn Töpel <bjorn.topel@...el.com>,
michael.lundkvist@...csson.com, jesse.brandeburg@...el.com,
anjali.singhai@...el.com, jeffrey.b.shaw@...el.com,
ferruh.yigit@...el.com, qi.z.zhang@...el.com
Subject: Re: [RFC PATCH 00/24] Introducing AF_XDP support
On 03/26/2018 09:38 AM, Jesper Dangaard Brouer wrote:
>
> On Mon, 26 Mar 2018 09:06:54 -0700 William Tu <u9012063@...il.com> wrote:
>
>> On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel <bjorn.topel@...il.com> wrote:
>>> From: Björn Töpel <bjorn.topel@...el.com>
>>>
>>> This RFC introduces a new address family called AF_XDP that is
>>> optimized for high performance packet processing and zero-copy
>>> semantics. Throughput improvements can be up to 20x compared to V2 and
>>> V3 for the micro benchmarks included. Would be great to get your
>>> feedback on it. Note that this is the follow up RFC to AF_PACKET V4
>>> from November last year. The feedback from that RFC submission and the
>>> presentation at NetdevConf in Seoul was to create a new address family
>>> instead of building on top of AF_PACKET. AF_XDP is this new address
>>> family.
>>>
>>> The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor
>>> level is that TX and RX descriptors are separated from packet
>>> buffers. An RX or TX descriptor points to a data buffer in a packet
>>> buffer area. RX and TX can share the same packet buffer so that a
>>> packet does not have to be copied between RX and TX. Moreover, if a
>>> packet needs to be kept for a while due to a possible retransmit, then
>>> the descriptor that points to that packet buffer can be changed to
>>> point to another buffer and reused right away. This again avoids
>>> copying data.
>>>
>>> The RX and TX descriptor rings are registered with the setsockopts
>>> XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer
>>> area is allocated by user space and registered with the kernel using
>>> the new XDP_MEM_REG setsockopt. All these three areas are shared
>>> between user space and kernel space. The socket is then bound with a
>>> bind() call to a device and a specific queue id on that device, and it
>>> is not until bind is completed that traffic starts to flow.
>>>
>>> An XDP program can be loaded to direct part of the traffic on that
>>> device and queue id to user space through a new redirect action in an
>>> XDP program called bpf_xdpsk_redirect that redirects a packet up to
>>> the socket in user space. All the other XDP actions work just as
>>> before. Note that the current RFC requires the user to load an XDP
>>> program to get any traffic to user space (for example all traffic to
>>> user space with the one-liner program "return
>>> bpf_xdpsk_redirect();"). We plan on introducing a patch that removes
>>> this requirement and sends all traffic from a queue to user space if
>>> an AF_XDP socket is bound to it.
>>>
>>> AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and
>>> XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there
>>> is no specific mode called XDP_DRV_ZC). If the driver does not have
>>> support for XDP, or XDP_SKB is explicitly chosen when loading the XDP
>>> program, XDP_SKB mode is employed that uses SKBs together with the
>>> generic XDP support and copies out the data to user space. A fallback
>>> mode that works for any network device. On the other hand, if the
>>> driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and
>>> ndo_xdp_flush), these NDOs, without any modifications, will be used by
>>> the AF_XDP code to provide better performance, but there is still a
>>> copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP
>>> driver support with the zero-copy user space allocator that provides
>>> even better performance. In this mode, the networking HW (or SW driver
>>> if it is a virtual driver like veth) DMAs/puts packets straight into
>>> the packet buffer that is shared between user space and kernel
>>> space. The RX and TX descriptor queues of the networking HW are NOT
>>> shared to user space. Only the kernel can read and write these and it
>>> is the kernel driver's responsibility to translate these HW specific
>>> descriptors to the HW agnostic ones in the virtual descriptor rings
>>> that user space sees. This way, a malicious user space program cannot
>>> mess with the networking HW. This mode though requires some extensions
>>> to XDP.
>>>
>>> To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a
>>> buffer pool concept so that the same XDP driver code can be used for
>>> buffers allocated using the page allocator (XDP_DRV), the user-space
>>> zero-copy allocator (XDP_DRV_ZC), or some internal driver specific
>>> allocator/cache/recycling mechanism. The ndo_bpf call has also been
>>> extended with two commands for registering and unregistering an XSK
>>> socket and is in the RX case mainly used to communicate some
>>> information about the user-space buffer pool to the driver.
>>>
>>> For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush,
>>> but we run into problems with this (further discussion in the
>>> challenges section) and had to introduce a new NDO called
>>> ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice
>>> and an explicit queue id that packets should be sent out on. In
>>> contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be
>>> sent from the xdp socket (associated with the dev and queue
>>> combination that was provided with the NDO call) using a callback
>>> (get_tx_packet), and when they have been transmitted it uses another
>>> callback (tx_completion) to signal completion of packets. These
>>> callbacks are set via ndo_bpf in the new XDP_REGISTER_XSK
>>> command. ndo_xdp_xmit_xsk is exclusively used by the XDP socket code
>>> and thus does not clash with the XDP_REDIRECT use of
>>> ndo_xdp_xmit. This is one of the reasons that the XDP_DRV mode
>>> (without ZC) is currently not supported by TX. Please have a look at
>>> the challenges section for further discussions.
>>>
>>> The AF_XDP bind call acts on a queue pair (channel in ethtool speak),
>>> so the user needs to steer the traffic to the zero-copy enabled queue
>>> pair. Which queue to use, is up to the user.
>>>
>>> For an untrusted application, HW packet steering to a specific queue
>>> pair (the one associated with the application) is a requirement, as
>>> the application would otherwise be able to see other user space
>>> processes' packets. If the HW cannot support the required packet
>>> steering, XDP_DRV or XDP_SKB mode have to be used as they do not
>>> expose the NIC's packet buffer into user space as the packets are
>>> copied into user space from the NIC's packet buffer in the kernel.
>>>
>>> There is a xdpsock benchmarking/test application included. Say that
>>> you would like your UDP traffic from port 4242 to end up in queue 16,
>>> that we will enable AF_XDP on. Here, we use ethtool for this:
>>>
>>> ethtool -N p3p2 rx-flow-hash udp4 fn
>>> ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>>> action 16
>>>
>>> Running the l2fwd benchmark in XDP_DRV_ZC mode can then be done using:
>>>
>>> samples/bpf/xdpsock -i p3p2 -q 16 -l -N
>>>
>>> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
>>> can be displayed with "-h", as usual.
>>>
>>> We have run some benchmarks on a dual socket system with two Broadwell
>>> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
>>> cores which gives a total of 28, but only two cores are used in these
>>> experiments. One for TR/RX and one for the user space application. The
>>> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
>>> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
>>> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
>>> Intel I40E 40Gbit/s using the i40e driver.
>>>
>>> Below are the results in Mpps of the I40E NIC benchmark runs for 64
>>> byte packets, generated by commercial packet generator HW that is
>>> generating packets at full 40 Gbit/s line rate.
>>>
>>> XDP baseline numbers without this RFC:
>>> xdp_rxq_info --action XDP_DROP 31.3 Mpps
>>> xdp_rxq_info --action XDP_TX 16.7 Mpps
>>>
>>> XDP performance with this RFC i.e. with the buffer allocator:
>>> XDP_DROP 21.0 Mpps
>>> XDP_TX 11.9 Mpps
>>>
>>> AF_PACKET V4 performance from previous RFC on 4.14-rc7:
>>> Benchmark V2 V3 V4 V4+ZC
>>> rxdrop 0.67 0.73 0.74 33.7
>>> txpush 0.98 0.98 0.91 19.6
>>> l2fwd 0.66 0.71 0.67 15.5
>>>
>>> AF_XDP performance:
>>> Benchmark XDP_SKB XDP_DRV XDP_DRV_ZC (all in Mpps)
>>> rxdrop 3.3 11.6 16.9
>>> txpush 2.2 NA* 21.8
>>> l2fwd 1.7 NA* 10.4
>>>
>>
>> Hi,
>> I also did an evaluation of AF_XDP, however the performance isn't as
>> good as above.
>> I'd like to share the result and see if there are some tuning suggestions.
>>
>> System:
>> 16 core, Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz
>> Intel 10G X540-AT2 ---> so I can only run XDP_SKB mode
>
> Hmmm, why is X540-AT2 not able to use XDP natively?
>
>> AF_XDP performance:
>> Benchmark XDP_SKB
>> rxdrop 1.27 Mpps
>> txpush 0.99 Mpps
>> l2fwd 0.85 Mpps
>
> Definitely too low...
>
> What is the performance if you drop packets via iptables?
>
> Command:
> $ iptables -t raw -I PREROUTING -p udp --dport 9 --j DROP
>
>> NIC configuration:
>> the command
>> "ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 action 16"
>> doesn't work on my ixgbe driver, so I use ntuple:
>>
>> ethtool -K enp10s0f0 ntuple on
>> ethtool -U enp10s0f0 flow-type udp4 src-ip 10.1.1.100 action 1
>> then
>> echo 1 > /proc/sys/net/core/bpf_jit_enable
>> ./xdpsock -i enp10s0f0 -r -S --queue=1
>>
>> I also take a look at perf result:
>> For rxdrop:
>> 86.56% xdpsock xdpsock [.] main
>> 9.22% xdpsock [kernel.vmlinux] [k] nmi
>> 4.23% xdpsock xdpsock [.] xq_enq
>
> It looks very strange that you see non-maskable interrupt's (NMI) being
> this high...
>
>
>> For l2fwd:
>> 20.81% xdpsock xdpsock [.] main
>> 10.64% xdpsock [kernel.vmlinux] [k] clflush_cache_range
>
> Oh, clflush_cache_range is being called!
> Do your system use an IOMMU ?
Whats the implication here. Should IOMMU be disabled?
I'm asking because I do see a huge difference while running pktgen test
for my performance benchmarks, with and without intel_iommu.
-Tushar
>
>> 8.46% xdpsock [kernel.vmlinux] [k] xsk_sendmsg
>> 6.72% xdpsock [kernel.vmlinux] [k] skb_set_owner_w
>> 5.89% xdpsock [kernel.vmlinux] [k] __domain_mapping
>> 5.74% xdpsock [kernel.vmlinux] [k] alloc_skb_with_frags
>> 4.62% xdpsock [kernel.vmlinux] [k] netif_skb_features
>> 3.96% xdpsock [kernel.vmlinux] [k] ___slab_alloc
>> 3.18% xdpsock [kernel.vmlinux] [k] nmi
>
> Again high count for NMI ?!?
>
> Maybe you just forgot to tell perf that you want it to decode the
> bpf_prog correctly?
>
> https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols
>
> Enable via:
> $ sysctl net/core/bpf_jit_kallsyms=1
>
> And use perf report (while BPF is STILL LOADED):
>
> $ perf report --kallsyms=/proc/kallsyms
>
> E.g. for emailing this you can use this command:
>
> $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms --no-children --stdio -g none | head -n 40
>
>
>> I observed that the i40e's XDP_SKB result is much better than my ixgbe's result.
>> I wonder in XDP_SKB mode, does the driver make performance difference?
>> Or my cpu (E5-2440 v2 @ 1.90GHz) is too old?
>
> I suspect some setup issue on your system.
>
Powered by blists - more mailing lists