[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ+HfNisVu0yUvErNt3RcqMgW7Gk66B4qS42excZkR+5XxnoEA@mail.gmail.com>
Date: Fri, 27 Apr 2018 14:21:25 +0200
From: Björn Töpel <bjorn.topel@...il.com>
To: Bjorn Topel <bjorn.topel@...il.com>,
"Karlsson, Magnus" <magnus.karlsson@...el.com>,
"Duyck, Alexander H" <alexander.h.duyck@...el.com>,
Alexander Duyck <alexander.duyck@...il.com>,
John Fastabend <john.fastabend@...il.com>,
Alexei Starovoitov <ast@...com>,
Jesper Dangaard Brouer <brouer@...hat.com>,
Willem de Bruijn <willemdebruijn.kernel@...il.com>,
Daniel Borkmann <daniel@...earbox.net>,
"Michael S. Tsirkin" <mst@...hat.com>,
Netdev <netdev@...r.kernel.org>
Cc: Björn Töpel <bjorn.topel@...el.com>,
michael.lundkvist@...csson.com,
"Brandeburg, Jesse" <jesse.brandeburg@...el.com>,
"Singhai, Anjali" <anjali.singhai@...el.com>,
"Zhang, Qi Z" <qi.z.zhang@...el.com>
Subject: Re: [PATCH bpf-next v2 00/15] Introducing AF_XDP support
2018-04-27 14:17 GMT+02:00 Björn Töpel <bjorn.topel@...il.com>:
> From: Björn Töpel <bjorn.topel@...el.com>
>
> This patch set introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this v2 version, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This patch set only supports copy-mode
> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
> for RX using the XDP_DRV path. Zero-copy support requires XDP and
> driver changes that Jesper Dangaard Brouer is working on. Some of his
> work has already been accepted. We will publish our zero-copy support
> for RX and TX on top of his patch sets at a later point in time.
>
> An AF_XDP socket (XSK) is created with the normal socket()
> syscall. Associated with each XSK are two queues: the RX queue and the
> TX queue. A socket can receive packets on the RX queue and it can send
> packets on the TX queue. These queues are registered and sized with
> the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
> mandatory to have at least one of these queues for each socket. In
> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
> packet buffers. An RX or TX descriptor points to a data buffer in a
> memory area called a UMEM. RX and TX can share the same UMEM so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, the
> descriptor that points to that packet can be changed to point to
> another and reused right away. This again avoids copying data.
>
> This new dedicated packet buffer area is call a UMEM. It consists of a
> number of equally size frames and each frame has a unique frame id. A
> descriptor in one of the queues references a frame by referencing its
> frame id. The user space allocates memory for this UMEM using whatever
> means it feels is most appropriate (malloc, mmap, huge pages,
> etc). This memory area is then registered with the kernel using the new
> setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
> and the COMPLETION queue. The fill queue is used by the application to
> send down frame ids for the kernel to fill in with RX packet
> data. References to these frames will then appear in the RX queue of
> the XSK once they have been received. The completion queue, on the
> other hand, contains frame ids that the kernel has transmitted
> completely and can now be used again by user space, for either TX or
> RX. Thus, the frame ids appearing in the completion queue are ids that
> were previously transmitted using the TX queue. In summary, the RX and
> FILL queues are used for the RX path and the TX and COMPLETION queues
> are used for the TX path.
>
> The socket is then finally bound with a bind() call to a device and a
> specific queue id on that device, and it is not until bind is
> completed that traffic starts to flow. Note that in this patch set,
> all packet data is copied out to user-space.
>
> A new feature in this patch set is that the UMEM can be shared between
> processes, if desired. If a process wants to do this, it simply skips
> the registration of the UMEM and its corresponding two queues, sets a
> flag in the bind call and submits the XSK of the process it would like
> to share UMEM with as well as its own newly created XSK socket. The
> new process will then receive frame id references in its own RX queue
> that point to this shared UMEM. Note that since the queue structures
> are single-consumer / single-producer (for performance reasons), the
> new process has to create its own socket with associated RX and TX
> queues, since it cannot share this with the other process. This is
> also the reason that there is only one set of FILL and COMPLETION
> queues per UMEM. It is the responsibility of a single process to
> handle the UMEM. If multiple-producer / multiple-consumer queues are
> implemented in the future, this requirement could be relaxed.
>
> How is then packets distributed between these two XSK? We have
> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
> full). The user-space application can place an XSK at an arbitrary
> place in this map. The XDP program can then redirect a packet to a
> specific index in this map and at this point XDP validates that the
> XSK in that map was indeed bound to that device and queue number. If
> not, the packet is dropped. If the map is empty at that index, the
> packet is also dropped. This also means that it is currently mandatory
> to have an XDP program loaded (and one XSK in the XSKMAP) to be able
> to get any traffic to user space through the XSK.
>
> AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
> driver does not have support for XDP, or XDP_SKB is explicitly chosen
> when loading the XDP program, XDP_SKB mode is employed that uses SKBs
> together with the generic XDP support and copies out the data to user
> space. A fallback mode that works for any network device. On the other
> hand, if the driver has support for XDP, it will be used by the AF_XDP
> code to provide better performance, but there is still a copy of the
> data into user space.
>
> There is a xdpsock benchmarking/test application included that
> demonstrates how to use AF_XDP sockets with both private and shared
> UMEMs. Say that you would like your UDP traffic from port 4242 to end
> up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
> for this:
>
> ethtool -N p3p2 rx-flow-hash udp4 fn
> ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
> action 16
>
> Running the rxdrop benchmark in XDP_DRV mode can then be done
> using:
>
> samples/bpf/xdpsock -i p3p2 -q 16 -r -N
>
> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
> can be displayed with "-h", as usual.
>
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
> Intel I40E 40Gbit/s using the i40e driver.
>
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by commercial packet generator HW that is
> generating packets at full 40 Gbit/s line rate.
>
> AF_XDP performance 64 byte packets. Results from V1 in parenthesis.
> Benchmark XDP_SKB XDP_DRV
> rxdrop 3.0(2.9) 9.5(9.4)
> txpush 2.5(2.5) NA*
> l2fwd 1.9(1.9) 2.5(2.4) (TX using XDP_SKB in both cases)
>
> AF_XDP performance 1500 byte packets:
> Benchmark XDP_SKB XDP_DRV
> rxdrop 2.2(2.1) 3.3(3.3)
> l2fwd 1.4(1.4) 1.8(1.8) (TX using XDP_SKB in both cases)
>
> * NA since we have no support for TX using the XDP_DRV infrastructure
> in this patch set. This is for a future patch set since it involves
> changes to the XDP NDOs. Some of this has been upstreamed by Jesper
> Dangaard Brouer.
>
> XDP performance on our system as a base line:
>
> 64 byte packets:
> XDP stats CPU pps issue-pps
> XDP-RX CPU 16 32,921,521 0
>
> 1500 byte packets:
> XDP stats CPU pps issue-pps
> XDP-RX CPU 16 3,289,491 0
>
> Changes from V1:
>
> * Fixes to bugs spotted by Will in his review
> * Implemented the performance otimization to BPF_MAP_TYPE_XSKMAP
> suggested by Will
> * Refactored packet_direct_xmit to become a common function
> in core/dev.c as suggested by Will
> * Added documentation as suggested by Jesper
> * Proper page unpinning as suggested by MST
> * Some minor code cleanups
>
> The structure of the patch set is as follows:
>
> Patches 1-3: Basic socket and umem plumbing
> Patches 4-9: RX support together with the new XSKMAP
> Patches 10-13: TX support
> Patch 14: Statistics support with getsockopt()
> Patch 15: Sample application
>
> We based this patch set on bpf-next commit
>
Oops, I pressed play on tape too soon. We based it on commit
79741a38b4a2 ("Merge
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next").
Björn
> To do for this patch set:
>
> * Syzkaller torture session being worked on
>
> Post-series plan:
>
> * Optimize performance
>
> * Kernel selftest
>
> * Kernel load module support of AF_XDP would be nice. Unclear how to
> achieve this though since our XDP code depends on net/core.
>
> * Support for AF_XDP sockets without an XPD program loaded. In this
> case all the traffic on a queue should go up to the user space socket.
>
> * Daniel Borkmann's suggestion for a "copy to XDP socket, and return
> XDP_PASS" for a tcpdump-like functionality.
>
> * And of course getting to zero-copy support in small increments.
>
> Thanks: Björn and Magnus
>
> Björn Töpel (7):
> net: initial AF_XDP skeleton
> xsk: add user memory registration support sockopt
> xsk: add Rx queue setup and mmap support
> xsk: add Rx receive functions and poll support
> bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
> xsk: wire up XDP_DRV side of AF_XDP
> xsk: wire up XDP_SKB side of AF_XDP
>
> Magnus Karlsson (8):
> xsk: add umem fill queue support and mmap
> xsk: add support for bind for Rx
> xsk: add umem completion queue support and mmap
> xsk: add Tx queue setup and mmap support
> dev: packet: make packet_direct_xmit a common function
> xsk: support for Tx
> xsk: statistics support
> samples/bpf: sample application and documentation for AF_XDP sockets
>
> Documentation/networking/af_xdp.rst | 297 +++++++++++
> Documentation/networking/index.rst | 1 +
> MAINTAINERS | 8 +
> include/linux/bpf.h | 26 +
> include/linux/bpf_types.h | 3 +
> include/linux/filter.h | 2 +-
> include/linux/netdevice.h | 1 +
> include/linux/socket.h | 5 +-
> include/net/xdp.h | 1 +
> include/net/xdp_sock.h | 66 +++
> include/uapi/linux/bpf.h | 1 +
> include/uapi/linux/if_xdp.h | 87 ++++
> kernel/bpf/Makefile | 3 +
> kernel/bpf/verifier.c | 8 +-
> kernel/bpf/xskmap.c | 272 +++++++++++
> net/Kconfig | 1 +
> net/Makefile | 1 +
> net/core/dev.c | 73 ++-
> net/core/filter.c | 40 +-
> net/core/sock.c | 12 +-
> net/core/xdp.c | 15 +-
> net/packet/af_packet.c | 42 +-
> net/xdp/Kconfig | 7 +
> net/xdp/Makefile | 2 +
> net/xdp/xdp_umem.c | 260 ++++++++++
> net/xdp/xdp_umem.h | 67 +++
> net/xdp/xdp_umem_props.h | 23 +
> net/xdp/xsk.c | 656 +++++++++++++++++++++++++
> net/xdp/xsk_queue.c | 73 +++
> net/xdp/xsk_queue.h | 247 ++++++++++
> samples/bpf/Makefile | 4 +
> samples/bpf/xdpsock.h | 11 +
> samples/bpf/xdpsock_kern.c | 56 +++
> samples/bpf/xdpsock_user.c | 948 ++++++++++++++++++++++++++++++++++++
> security/selinux/hooks.c | 4 +-
> security/selinux/include/classmap.h | 4 +-
> 36 files changed, 3255 insertions(+), 72 deletions(-)
> create mode 100644 Documentation/networking/af_xdp.rst
> create mode 100644 include/net/xdp_sock.h
> create mode 100644 include/uapi/linux/if_xdp.h
> create mode 100644 kernel/bpf/xskmap.c
> create mode 100644 net/xdp/Kconfig
> create mode 100644 net/xdp/Makefile
> create mode 100644 net/xdp/xdp_umem.c
> create mode 100644 net/xdp/xdp_umem.h
> create mode 100644 net/xdp/xdp_umem_props.h
> create mode 100644 net/xdp/xsk.c
> create mode 100644 net/xdp/xsk_queue.c
> create mode 100644 net/xdp/xsk_queue.h
> create mode 100644 samples/bpf/xdpsock.h
> create mode 100644 samples/bpf/xdpsock_kern.c
> create mode 100644 samples/bpf/xdpsock_user.c
>
> --
> 2.14.1
>
Powered by blists - more mailing lists