netdev - Re: [PATCH bpf-next 00/15] Introducing AF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180424022124-mutt-send-email-mst@kernel.org>
Date:   Tue, 24 Apr 2018 02:22:24 +0300
From:   "Michael S. Tsirkin" <mst@...hat.com>
To:     Björn Töpel <bjorn.topel@...il.com>
Cc:     magnus.karlsson@...el.com, alexander.h.duyck@...el.com,
        alexander.duyck@...il.com, john.fastabend@...il.com, ast@...com,
        brouer@...hat.com, willemdebruijn.kernel@...il.com,
        daniel@...earbox.net, netdev@...r.kernel.org,
        Björn Töpel <bjorn.topel@...el.com>,
        michael.lundkvist@...csson.com, jesse.brandeburg@...el.com,
        anjali.singhai@...el.com, qi.z.zhang@...el.com
Subject: Re: [PATCH bpf-next 00/15] Introducing AF_XDP support

On Mon, Apr 23, 2018 at 03:56:04PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@...el.com>
> 
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this v2 version, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This RFC only supports copy-mode for
> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
> using the XDP_DRV path. Zero-copy support requires XDP and driver
> changes that Jesper Dangaard Brouer is working on. Some of his work
> has already been accepted. We will publish our zero-copy support for
> RX and TX on top of his patch sets at a later point in time.
> 
> An AF_XDP socket (XSK) is created with the normal socket()
> syscall. Associated with each XSK are two queues: the RX queue and the
> TX queue. A socket can receive packets on the RX queue and it can send
> packets on the TX queue. These queues are registered and sized with
> the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
> mandatory to have at least one of these queues for each socket. In
> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
> packet buffers. An RX or TX descriptor points to a data buffer in a
> memory area called a UMEM. RX and TX can share the same UMEM so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, the
> descriptor that points to that packet can be changed to point to
> another and reused right away. This again avoids copying data.
> 
> This new dedicated packet buffer area is call a UMEM. It consists of a
> number of equally size frames and each frame has a unique frame id. A
> descriptor in one of the queues references a frame by referencing its
> frame id. The user space allocates memory for this UMEM using whatever
> means it feels is most appropriate (malloc, mmap, huge pages,
> etc). This memory area is then registered with the kernel using the new
> setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
> and the COMPLETION queue. The fill queue is used by the application to
> send down frame ids for the kernel to fill in with RX packet
> data. References to these frames will then appear in the RX queue of
> the XSK once they have been received. The completion queue, on the
> other hand, contains frame ids that the kernel has transmitted
> completely and can now be used again by user space, for either TX or
> RX. Thus, the frame ids appearing in the completion queue are ids that
> were previously transmitted using the TX queue. In summary, the RX and
> FILL queues are used for the RX path and the TX and COMPLETION queues
> are used for the TX path.
> 
> The socket is then finally bound with a bind() call to a device and a
> specific queue id on that device, and it is not until bind is
> completed that traffic starts to flow. Note that in this RFC, all
> packet data is copied out to user-space.
> 
> A new feature in this RFC is that the UMEM can be shared between
> processes, if desired. If a process wants to do this, it simply skips
> the registration of the UMEM and its corresponding two queues, sets a
> flag in the bind call and submits the XSK of the process it would like
> to share UMEM with as well as its own newly created XSK socket. The
> new process will then receive frame id references in its own RX queue
> that point to this shared UMEM. Note that since the queue structures
> are single-consumer / single-producer (for performance reasons), the
> new process has to create its own socket with associated RX and TX
> queues, since it cannot share this with the other process. This is
> also the reason that there is only one set of FILL and COMPLETION
> queues per UMEM. It is the responsibility of a single process to
> handle the UMEM. If multiple-producer / multiple-consumer queues are
> implemented in the future, this requirement could be relaxed.
> 
> How is then packets distributed between these two XSK? We have
> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
> full). The user-space application can place an XSK at an arbitrary
> place in this map. The XDP program can then redirect a packet to a
> specific index in this map and at this point XDP validates that the
> XSK in that map was indeed bound to that device and queue number. If
> not, the packet is dropped. If the map is empty at that index, the
> packet is also dropped. This also means that it is currently mandatory
> to have an XDP program loaded (and one XSK in the XSKMAP) to be able
> to get any traffic to user space through the XSK.
> 
> AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
> driver does not have support for XDP, or XDP_SKB is explicitly chosen
> when loading the XDP program, XDP_SKB mode is employed that uses SKBs
> together with the generic XDP support and copies out the data to user
> space. A fallback mode that works for any network device. On the other
> hand, if the driver has support for XDP, it will be used by the AF_XDP
> code to provide better performance, but there is still a copy of the
> data into user space.
> 
> There is a xdpsock benchmarking/test application included that
> demonstrates how to use AF_XDP sockets with both private and shared
> UMEMs. Say that you would like your UDP traffic from port 4242 to end
> up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
> for this:
> 
>       ethtool -N p3p2 rx-flow-hash udp4 fn
>       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>           action 16
> 
> Running the rxdrop benchmark in XDP_DRV mode can then be done
> using:
> 
>       samples/bpf/xdpsock -i p3p2 -q 16 -r -N
> 
> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
> can be displayed with "-h", as usual.
> 
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
> Intel I40E 40Gbit/s using the i40e driver.
> 
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by commercial packet generator HW that is
> generating packets at full 40 Gbit/s line rate.
> 
> AF_XDP performance 64 byte packets. Results from RFC V2 in parenthesis.
> Benchmark   XDP_SKB   XDP_DRV
> rxdrop       2.9(3.0)   9.4(9.3)  
> txpush       2.5(2.2)   NA*
> l2fwd        1.9(1.7)   2.4(2.4) (TX using XDP_SKB in both cases)
> 
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV
> rxdrop       2.1(2.2)   3.3(3.1)  
> l2fwd        1.4(1.1)   1.8(1.7) (TX using XDP_SKB in both cases)
> 
> * NA since we have no support for TX using the XDP_DRV infrastructure
>   in this RFC. This is for a future patch set since it involves
>   changes to the XDP NDOs. Some of this has been upstreamed by Jesper
>   Dangaard Brouer.
> 
> XDP performance on our system as a base line:
> 
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      32,921,521  0
> 
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3,289,491   0
> 
> Changes from RFC V2:
> 
> * Optimizations and simplifications to the ring structures inspired by
>   ptr_ring.h 
> * Renamed XDP_[RX|TX]_QUEUE to XDP_[RX|TX]_RING in the uapi to be
>   consistent with AF_PACKET
> * Support for only having an RX queue or a TX queue defined
> * Some bug fixes and code cleanup
> 
> The structure of the patch set is as follows:
> 
> Patches 1-2: Basic socket and umem plumbing 
> Patches 3-10: RX support together with the new XSKMAP
> Patches 11-14: TX support
> Patch 15: Sample application
> 
> We based this patch set on bpf-next commit fbcf93ebcaef ("bpf: btf:
> Clean up btf.h in uapi")
> 
> Questions:
> 
> * How to deal with cache alignment for uapi when different
>   architectures can have different cache line sizes? We have just
>   aligned it to 64 bytes for now, which works for many popular
>   architectures, but not all. Please advise.
> 
> To do:
> 
> * Optimize performance
> 
> * Kernel selftest
> 
> Post-series plan:
> 
> * Kernel load module support of AF_XDP would be nice. Unclear how to
>   achieve this though since our XDP code depends on net/core.
> 
> * Support for AF_XDP sockets without an XPD program loaded. In this
>   case all the traffic on a queue should go up to the user space socket.
> 
> * Daniel Borkmann's suggestion for a "copy to XDP socket, and return
>   XDP_PASS" for a tcpdump-like functionality.
> 
> * And of course getting to zero-copy support in small increments. 
> 
> Thanks: Björn and Magnus
> 
> Björn Töpel (8):
>   net: initial AF_XDP skeleton
>   xsk: add user memory registration support sockopt
>   xsk: add Rx queue setup and mmap support
>   xdp: introduce xdp_return_buff API
>   xsk: add Rx receive functions and poll support
>   bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
>   xsk: wire up XDP_DRV side of AF_XDP
>   xsk: wire up XDP_SKB side of AF_XDP
> 
> Magnus Karlsson (7):
>   xsk: add umem fill queue support and mmap
>   xsk: add support for bind for Rx
>   xsk: add umem completion queue support and mmap
>   xsk: add Tx queue setup and mmap support
>   xsk: support for Tx
>   xsk: statistics support
>   samples/bpf: sample application for AF_XDP sockets
> 
>  MAINTAINERS                         |   8 +
>  include/linux/bpf.h                 |  26 +
>  include/linux/bpf_types.h           |   3 +
>  include/linux/filter.h              |   2 +-
>  include/linux/socket.h              |   5 +-
>  include/net/xdp.h                   |   1 +
>  include/net/xdp_sock.h              |  46 ++
>  include/uapi/linux/bpf.h            |   1 +
>  include/uapi/linux/if_xdp.h         |  87 ++++
>  kernel/bpf/Makefile                 |   3 +
>  kernel/bpf/verifier.c               |   8 +-
>  kernel/bpf/xskmap.c                 | 286 +++++++++++
>  net/Kconfig                         |   1 +
>  net/Makefile                        |   1 +
>  net/core/dev.c                      |  34 +-
>  net/core/filter.c                   |  40 +-
>  net/core/sock.c                     |  12 +-
>  net/core/xdp.c                      |  15 +-
>  net/xdp/Kconfig                     |   7 +
>  net/xdp/Makefile                    |   2 +
>  net/xdp/xdp_umem.c                  | 256 ++++++++++
>  net/xdp/xdp_umem.h                  |  65 +++
>  net/xdp/xdp_umem_props.h            |  23 +
>  net/xdp/xsk.c                       | 704 +++++++++++++++++++++++++++
>  net/xdp/xsk_queue.c                 |  73 +++
>  net/xdp/xsk_queue.h                 | 245 ++++++++++
>  samples/bpf/Makefile                |   4 +
>  samples/bpf/xdpsock.h               |  11 +
>  samples/bpf/xdpsock_kern.c          |  56 +++
>  samples/bpf/xdpsock_user.c          | 947 ++++++++++++++++++++++++++++++++++++
>  security/selinux/hooks.c            |   4 +-
>  security/selinux/include/classmap.h |   4 +-
>  32 files changed, 2945 insertions(+), 35 deletions(-)
>  create mode 100644 include/net/xdp_sock.h
>  create mode 100644 include/uapi/linux/if_xdp.h
>  create mode 100644 kernel/bpf/xskmap.c
>  create mode 100644 net/xdp/Kconfig
>  create mode 100644 net/xdp/Makefile
>  create mode 100644 net/xdp/xdp_umem.c
>  create mode 100644 net/xdp/xdp_umem.h
>  create mode 100644 net/xdp/xdp_umem_props.h
>  create mode 100644 net/xdp/xsk.c
>  create mode 100644 net/xdp/xsk_queue.c
>  create mode 100644 net/xdp/xsk_queue.h
>  create mode 100644 samples/bpf/xdpsock.h
>  create mode 100644 samples/bpf/xdpsock_kern.c
>  create mode 100644 samples/bpf/xdpsock_user.c

Is there a chance of Documentation/networking/af_xdp.txt ?


> 
> -- 
> 2.14.1