netdev - Re: [RFC PATCH 00/24] Introducing AF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 26 Mar 2018 16:20:28 -0700
From:   Tushar Dave <tushar.n.dave@...cle.com>
To:     Alexander Duyck <alexander.duyck@...il.com>
Cc:     Jesper Dangaard Brouer <brouer@...hat.com>,
        William Tu <u9012063@...il.com>,
        Björn Töpel <bjorn.topel@...il.com>,
        "Karlsson, Magnus" <magnus.karlsson@...el.com>,
        Alexander Duyck <alexander.h.duyck@...el.com>,
        John Fastabend <john.fastabend@...il.com>,
        Alexei Starovoitov <ast@...com>,
        Willem de Bruijn <willemdebruijn.kernel@...il.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Björn Töpel <bjorn.topel@...el.com>,
        michael.lundkvist@...csson.com,
        "Brandeburg, Jesse" <jesse.brandeburg@...el.com>,
        Anjali Singhai Jain <anjali.singhai@...el.com>,
        jeffrey.b.shaw@...el.com, ferruh.yigit@...el.com,
        qi.z.zhang@...el.com
Subject: Re: [RFC PATCH 00/24] Introducing AF_XDP support



On 03/26/2018 04:03 PM, Alexander Duyck wrote:
> On Mon, Mar 26, 2018 at 3:54 PM, Tushar Dave <tushar.n.dave@...cle.com> wrote:
>>
>>
>> On 03/26/2018 09:38 AM, Jesper Dangaard Brouer wrote:
>>>
>>>
>>> On Mon, 26 Mar 2018 09:06:54 -0700 William Tu <u9012063@...il.com> wrote:
>>>
>>>> On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel <bjorn.topel@...il.com>
>>>> wrote:
>>>>>
>>>>> From: Björn Töpel <bjorn.topel@...el.com>
>>>>>
>>>>> This RFC introduces a new address family called AF_XDP that is
>>>>> optimized for high performance packet processing and zero-copy
>>>>> semantics. Throughput improvements can be up to 20x compared to V2 and
>>>>> V3 for the micro benchmarks included. Would be great to get your
>>>>> feedback on it. Note that this is the follow up RFC to AF_PACKET V4
>>>>> from November last year. The feedback from that RFC submission and the
>>>>> presentation at NetdevConf in Seoul was to create a new address family
>>>>> instead of building on top of AF_PACKET. AF_XDP is this new address
>>>>> family.
>>>>>
>>>>> The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor
>>>>> level is that TX and RX descriptors are separated from packet
>>>>> buffers. An RX or TX descriptor points to a data buffer in a packet
>>>>> buffer area. RX and TX can share the same packet buffer so that a
>>>>> packet does not have to be copied between RX and TX. Moreover, if a
>>>>> packet needs to be kept for a while due to a possible retransmit, then
>>>>> the descriptor that points to that packet buffer can be changed to
>>>>> point to another buffer and reused right away. This again avoids
>>>>> copying data.
>>>>>
>>>>> The RX and TX descriptor rings are registered with the setsockopts
>>>>> XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer
>>>>> area is allocated by user space and registered with the kernel using
>>>>> the new XDP_MEM_REG setsockopt. All these three areas are shared
>>>>> between user space and kernel space. The socket is then bound with a
>>>>> bind() call to a device and a specific queue id on that device, and it
>>>>> is not until bind is completed that traffic starts to flow.
>>>>>
>>>>> An XDP program can be loaded to direct part of the traffic on that
>>>>> device and queue id to user space through a new redirect action in an
>>>>> XDP program called bpf_xdpsk_redirect that redirects a packet up to
>>>>> the socket in user space. All the other XDP actions work just as
>>>>> before. Note that the current RFC requires the user to load an XDP
>>>>> program to get any traffic to user space (for example all traffic to
>>>>> user space with the one-liner program "return
>>>>> bpf_xdpsk_redirect();"). We plan on introducing a patch that removes
>>>>> this requirement and sends all traffic from a queue to user space if
>>>>> an AF_XDP socket is bound to it.
>>>>>
>>>>> AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and
>>>>> XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there
>>>>> is no specific mode called XDP_DRV_ZC). If the driver does not have
>>>>> support for XDP, or XDP_SKB is explicitly chosen when loading the XDP
>>>>> program, XDP_SKB mode is employed that uses SKBs together with the
>>>>> generic XDP support and copies out the data to user space. A fallback
>>>>> mode that works for any network device. On the other hand, if the
>>>>> driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and
>>>>> ndo_xdp_flush), these NDOs, without any modifications, will be used by
>>>>> the AF_XDP code to provide better performance, but there is still a
>>>>> copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP
>>>>> driver support with the zero-copy user space allocator that provides
>>>>> even better performance. In this mode, the networking HW (or SW driver
>>>>> if it is a virtual driver like veth) DMAs/puts packets straight into
>>>>> the packet buffer that is shared between user space and kernel
>>>>> space. The RX and TX descriptor queues of the networking HW are NOT
>>>>> shared to user space. Only the kernel can read and write these and it
>>>>> is the kernel driver's responsibility to translate these HW specific
>>>>> descriptors to the HW agnostic ones in the virtual descriptor rings
>>>>> that user space sees. This way, a malicious user space program cannot
>>>>> mess with the networking HW. This mode though requires some extensions
>>>>> to XDP.
>>>>>
>>>>> To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a
>>>>> buffer pool concept so that the same XDP driver code can be used for
>>>>> buffers allocated using the page allocator (XDP_DRV), the user-space
>>>>> zero-copy allocator (XDP_DRV_ZC), or some internal driver specific
>>>>> allocator/cache/recycling mechanism. The ndo_bpf call has also been
>>>>> extended with two commands for registering and unregistering an XSK
>>>>> socket and is in the RX case mainly used to communicate some
>>>>> information about the user-space buffer pool to the driver.
>>>>>
>>>>> For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush,
>>>>> but we run into problems with this (further discussion in the
>>>>> challenges section) and had to introduce a new NDO called
>>>>> ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice
>>>>> and an explicit queue id that packets should be sent out on. In
>>>>> contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be
>>>>> sent from the xdp socket (associated with the dev and queue
>>>>> combination that was provided with the NDO call) using a callback
>>>>> (get_tx_packet), and when they have been transmitted it uses another
>>>>> callback (tx_completion) to signal completion of packets. These
>>>>> callbacks are set via ndo_bpf in the new XDP_REGISTER_XSK
>>>>> command. ndo_xdp_xmit_xsk is exclusively used by the XDP socket code
>>>>> and thus does not clash with the XDP_REDIRECT use of
>>>>> ndo_xdp_xmit. This is one of the reasons that the XDP_DRV mode
>>>>> (without ZC) is currently not supported by TX. Please have a look at
>>>>> the challenges section for further discussions.
>>>>>
>>>>> The AF_XDP bind call acts on a queue pair (channel in ethtool speak),
>>>>> so the user needs to steer the traffic to the zero-copy enabled queue
>>>>> pair. Which queue to use, is up to the user.
>>>>>
>>>>> For an untrusted application, HW packet steering to a specific queue
>>>>> pair (the one associated with the application) is a requirement, as
>>>>> the application would otherwise be able to see other user space
>>>>> processes' packets. If the HW cannot support the required packet
>>>>> steering, XDP_DRV or XDP_SKB mode have to be used as they do not
>>>>> expose the NIC's packet buffer into user space as the packets are
>>>>> copied into user space from the NIC's packet buffer in the kernel.
>>>>>
>>>>> There is a xdpsock benchmarking/test application included. Say that
>>>>> you would like your UDP traffic from port 4242 to end up in queue 16,
>>>>> that we will enable AF_XDP on. Here, we use ethtool for this:
>>>>>
>>>>>         ethtool -N p3p2 rx-flow-hash udp4 fn
>>>>>         ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>>>>>             action 16
>>>>>
>>>>> Running the l2fwd benchmark in XDP_DRV_ZC mode can then be done using:
>>>>>
>>>>>         samples/bpf/xdpsock -i p3p2 -q 16 -l -N
>>>>>
>>>>> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
>>>>> can be displayed with "-h", as usual.
>>>>>
>>>>> We have run some benchmarks on a dual socket system with two Broadwell
>>>>> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
>>>>> cores which gives a total of 28, but only two cores are used in these
>>>>> experiments. One for TR/RX and one for the user space application. The
>>>>> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
>>>>> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
>>>>> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
>>>>> Intel I40E 40Gbit/s using the i40e driver.
>>>>>
>>>>> Below are the results in Mpps of the I40E NIC benchmark runs for 64
>>>>> byte packets, generated by commercial packet generator HW that is
>>>>> generating packets at full 40 Gbit/s line rate.
>>>>>
>>>>> XDP baseline numbers without this RFC:
>>>>> xdp_rxq_info --action XDP_DROP 31.3 Mpps
>>>>> xdp_rxq_info --action XDP_TX   16.7 Mpps
>>>>>
>>>>> XDP performance with this RFC i.e. with the buffer allocator:
>>>>> XDP_DROP 21.0 Mpps
>>>>> XDP_TX   11.9 Mpps
>>>>>
>>>>> AF_PACKET V4 performance from previous RFC on 4.14-rc7:
>>>>> Benchmark   V2     V3     V4     V4+ZC
>>>>> rxdrop      0.67   0.73   0.74   33.7
>>>>> txpush      0.98   0.98   0.91   19.6
>>>>> l2fwd       0.66   0.71   0.67   15.5
>>>>>
>>>>> AF_XDP performance:
>>>>> Benchmark   XDP_SKB   XDP_DRV    XDP_DRV_ZC (all in Mpps)
>>>>> rxdrop      3.3        11.6         16.9
>>>>> txpush      2.2         NA*         21.8
>>>>> l2fwd       1.7         NA*         10.4
>>>>>
>>>>
>>>>
>>>> Hi,
>>>> I also did an evaluation of AF_XDP, however the performance isn't as
>>>> good as above.
>>>> I'd like to share the result and see if there are some tuning
>>>> suggestions.
>>>>
>>>> System:
>>>> 16 core, Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz
>>>> Intel 10G X540-AT2 ---> so I can only run XDP_SKB mode
>>>
>>>
>>> Hmmm, why is X540-AT2 not able to use XDP natively?
>>>
>>>> AF_XDP performance:
>>>> Benchmark   XDP_SKB
>>>> rxdrop      1.27 Mpps
>>>> txpush      0.99 Mpps
>>>> l2fwd        0.85 Mpps
>>>
>>>
>>> Definitely too low...
>>>
>>> What is the performance if you drop packets via iptables?
>>>
>>> Command:
>>>    $ iptables -t raw -I PREROUTING -p udp --dport 9 --j DROP
>>>
>>>> NIC configuration:
>>>> the command
>>>> "ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 action 16"
>>>> doesn't work on my ixgbe driver, so I use ntuple:
>>>>
>>>> ethtool -K enp10s0f0 ntuple on
>>>> ethtool -U enp10s0f0 flow-type udp4 src-ip 10.1.1.100 action 1
>>>> then
>>>> echo 1 > /proc/sys/net/core/bpf_jit_enable
>>>> ./xdpsock -i enp10s0f0 -r -S --queue=1
>>>>
>>>> I also take a look at perf result:
>>>> For rxdrop:
>>>> 86.56%  xdpsock xdpsock           [.] main
>>>>     9.22%  xdpsock  [kernel.vmlinux]  [k] nmi
>>>>     4.23%  xdpsock  xdpsock         [.] xq_enq
>>>
>>>
>>> It looks very strange that you see non-maskable interrupt's (NMI) being
>>> this high...
>>>
>>>
>>>>
>>>> For l2fwd:
>>>>    20.81%  xdpsock xdpsock             [.] main
>>>>    10.64%  xdpsock [kernel.vmlinux]    [k] clflush_cache_range
>>>
>>>
>>> Oh, clflush_cache_range is being called!
>>> Do your system use an IOMMU ?
>>
>>
>> Whats the implication here. Should IOMMU be disabled?
>> I'm asking because I do see a huge difference while running pktgen test for
>> my performance benchmarks, with and without intel_iommu.
>>
>>
>> -Tushar
> 
> For the Intel parts the IOMMU can be expensive primarily for Tx, since
> it should have minimal impact if the Rx pages are pinned/recycled. I
> am assuming the same is true here for AF_XDP, Bjorn can correct me if
> I am wrong.

Indeed. Intel iommu has least effect on RX because of premap/recycle.
But TX dma map and unmap is really expensive!

> 
> Basically the IOMMU can make creating/destroying a DMA mapping really
> expensive. The easiest way to work around it in the case of the Intel
> IOMMU is to boot with "iommu=pt" which will create an identity mapping
> for the host. The downside is though that you then have the entire
> system accessible to the device unless a new mapping is created for it
> by assigning it to a new IOMMU domain.

Yeah thats what I would say, If you really want to use intel iommu and
don't want to hit by performance , use 'iommu=pt'.

Good to have confirmation from you Alex. Thanks.

btw, I don't want to distract this thread on iommu discussion however
even using 'pt' doesn't give you the same performance numbers that you
rather get with intel iommu disabled!

-Tushar

> 
> Thanks.
> 
> - Alex
>