lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ+HfNjJjVLPY_Si4-f91_o2HOQGCBmPuNN3cyAahpixTcRRXw@mail.gmail.com>
Date:   Tue, 24 Apr 2018 08:55:33 +0200
From:   Björn Töpel <bjorn.topel@...il.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>
Cc:     "Karlsson, Magnus" <magnus.karlsson@...el.com>,
        "Duyck, Alexander H" <alexander.h.duyck@...el.com>,
        Alexander Duyck <alexander.duyck@...il.com>,
        John Fastabend <john.fastabend@...il.com>,
        Alexei Starovoitov <ast@...com>,
        Jesper Dangaard Brouer <brouer@...hat.com>,
        Willem de Bruijn <willemdebruijn.kernel@...il.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Netdev <netdev@...r.kernel.org>,
        Björn Töpel <bjorn.topel@...el.com>,
        michael.lundkvist@...csson.com,
        "Brandeburg, Jesse" <jesse.brandeburg@...el.com>,
        "Singhai, Anjali" <anjali.singhai@...el.com>,
        "Zhang, Qi Z" <qi.z.zhang@...el.com>
Subject: Re: [PATCH bpf-next 00/15] Introducing AF_XDP support

2018-04-24 1:22 GMT+02:00 Michael S. Tsirkin <mst@...hat.com>:
> On Mon, Apr 23, 2018 at 03:56:04PM +0200, Björn Töpel wrote:
>> From: Björn Töpel <bjorn.topel@...el.com>
>>
>> This RFC introduces a new address family called AF_XDP that is
>> optimized for high performance packet processing and, in upcoming
>> patch sets, zero-copy semantics. In this v2 version, we have removed
>> all zero-copy related code in order to make it smaller, simpler and
>> hopefully more review friendly. This RFC only supports copy-mode for
>> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
>> using the XDP_DRV path. Zero-copy support requires XDP and driver
>> changes that Jesper Dangaard Brouer is working on. Some of his work
>> has already been accepted. We will publish our zero-copy support for
>> RX and TX on top of his patch sets at a later point in time.
>>
>> An AF_XDP socket (XSK) is created with the normal socket()
>> syscall. Associated with each XSK are two queues: the RX queue and the
>> TX queue. A socket can receive packets on the RX queue and it can send
>> packets on the TX queue. These queues are registered and sized with
>> the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
>> mandatory to have at least one of these queues for each socket. In
>> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
>> packet buffers. An RX or TX descriptor points to a data buffer in a
>> memory area called a UMEM. RX and TX can share the same UMEM so that a
>> packet does not have to be copied between RX and TX. Moreover, if a
>> packet needs to be kept for a while due to a possible retransmit, the
>> descriptor that points to that packet can be changed to point to
>> another and reused right away. This again avoids copying data.
>>
>> This new dedicated packet buffer area is call a UMEM. It consists of a
>> number of equally size frames and each frame has a unique frame id. A
>> descriptor in one of the queues references a frame by referencing its
>> frame id. The user space allocates memory for this UMEM using whatever
>> means it feels is most appropriate (malloc, mmap, huge pages,
>> etc). This memory area is then registered with the kernel using the new
>> setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
>> and the COMPLETION queue. The fill queue is used by the application to
>> send down frame ids for the kernel to fill in with RX packet
>> data. References to these frames will then appear in the RX queue of
>> the XSK once they have been received. The completion queue, on the
>> other hand, contains frame ids that the kernel has transmitted
>> completely and can now be used again by user space, for either TX or
>> RX. Thus, the frame ids appearing in the completion queue are ids that
>> were previously transmitted using the TX queue. In summary, the RX and
>> FILL queues are used for the RX path and the TX and COMPLETION queues
>> are used for the TX path.
>>
>> The socket is then finally bound with a bind() call to a device and a
>> specific queue id on that device, and it is not until bind is
>> completed that traffic starts to flow. Note that in this RFC, all
>> packet data is copied out to user-space.
>>
>> A new feature in this RFC is that the UMEM can be shared between
>> processes, if desired. If a process wants to do this, it simply skips
>> the registration of the UMEM and its corresponding two queues, sets a
>> flag in the bind call and submits the XSK of the process it would like
>> to share UMEM with as well as its own newly created XSK socket. The
>> new process will then receive frame id references in its own RX queue
>> that point to this shared UMEM. Note that since the queue structures
>> are single-consumer / single-producer (for performance reasons), the
>> new process has to create its own socket with associated RX and TX
>> queues, since it cannot share this with the other process. This is
>> also the reason that there is only one set of FILL and COMPLETION
>> queues per UMEM. It is the responsibility of a single process to
>> handle the UMEM. If multiple-producer / multiple-consumer queues are
>> implemented in the future, this requirement could be relaxed.
>>
>> How is then packets distributed between these two XSK? We have
>> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
>> full). The user-space application can place an XSK at an arbitrary
>> place in this map. The XDP program can then redirect a packet to a
>> specific index in this map and at this point XDP validates that the
>> XSK in that map was indeed bound to that device and queue number. If
>> not, the packet is dropped. If the map is empty at that index, the
>> packet is also dropped. This also means that it is currently mandatory
>> to have an XDP program loaded (and one XSK in the XSKMAP) to be able
>> to get any traffic to user space through the XSK.
>>
>> AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
>> driver does not have support for XDP, or XDP_SKB is explicitly chosen
>> when loading the XDP program, XDP_SKB mode is employed that uses SKBs
>> together with the generic XDP support and copies out the data to user
>> space. A fallback mode that works for any network device. On the other
>> hand, if the driver has support for XDP, it will be used by the AF_XDP
>> code to provide better performance, but there is still a copy of the
>> data into user space.
>>
>> There is a xdpsock benchmarking/test application included that
>> demonstrates how to use AF_XDP sockets with both private and shared
>> UMEMs. Say that you would like your UDP traffic from port 4242 to end
>> up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
>> for this:
>>
>>       ethtool -N p3p2 rx-flow-hash udp4 fn
>>       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>>           action 16
>>
>> Running the rxdrop benchmark in XDP_DRV mode can then be done
>> using:
>>
>>       samples/bpf/xdpsock -i p3p2 -q 16 -r -N
>>
>> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
>> can be displayed with "-h", as usual.
>>
>> We have run some benchmarks on a dual socket system with two Broadwell
>> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
>> cores which gives a total of 28, but only two cores are used in these
>> experiments. One for TR/RX and one for the user space application. The
>> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
>> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
>> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
>> Intel I40E 40Gbit/s using the i40e driver.
>>
>> Below are the results in Mpps of the I40E NIC benchmark runs for 64
>> and 1500 byte packets, generated by commercial packet generator HW that is
>> generating packets at full 40 Gbit/s line rate.
>>
>> AF_XDP performance 64 byte packets. Results from RFC V2 in parenthesis.
>> Benchmark   XDP_SKB   XDP_DRV
>> rxdrop       2.9(3.0)   9.4(9.3)
>> txpush       2.5(2.2)   NA*
>> l2fwd        1.9(1.7)   2.4(2.4) (TX using XDP_SKB in both cases)
>>
>> AF_XDP performance 1500 byte packets:
>> Benchmark   XDP_SKB   XDP_DRV
>> rxdrop       2.1(2.2)   3.3(3.1)
>> l2fwd        1.4(1.1)   1.8(1.7) (TX using XDP_SKB in both cases)
>>
>> * NA since we have no support for TX using the XDP_DRV infrastructure
>>   in this RFC. This is for a future patch set since it involves
>>   changes to the XDP NDOs. Some of this has been upstreamed by Jesper
>>   Dangaard Brouer.
>>
>> XDP performance on our system as a base line:
>>
>> 64 byte packets:
>> XDP stats       CPU     pps         issue-pps
>> XDP-RX CPU      16      32,921,521  0
>>
>> 1500 byte packets:
>> XDP stats       CPU     pps         issue-pps
>> XDP-RX CPU      16      3,289,491   0
>>
>> Changes from RFC V2:
>>
>> * Optimizations and simplifications to the ring structures inspired by
>>   ptr_ring.h
>> * Renamed XDP_[RX|TX]_QUEUE to XDP_[RX|TX]_RING in the uapi to be
>>   consistent with AF_PACKET
>> * Support for only having an RX queue or a TX queue defined
>> * Some bug fixes and code cleanup
>>
>> The structure of the patch set is as follows:
>>
>> Patches 1-2: Basic socket and umem plumbing
>> Patches 3-10: RX support together with the new XSKMAP
>> Patches 11-14: TX support
>> Patch 15: Sample application
>>
>> We based this patch set on bpf-next commit fbcf93ebcaef ("bpf: btf:
>> Clean up btf.h in uapi")
>>
>> Questions:
>>
>> * How to deal with cache alignment for uapi when different
>>   architectures can have different cache line sizes? We have just
>>   aligned it to 64 bytes for now, which works for many popular
>>   architectures, but not all. Please advise.
>>
>> To do:
>>
>> * Optimize performance
>>
>> * Kernel selftest
>>
>> Post-series plan:
>>
>> * Kernel load module support of AF_XDP would be nice. Unclear how to
>>   achieve this though since our XDP code depends on net/core.
>>
>> * Support for AF_XDP sockets without an XPD program loaded. In this
>>   case all the traffic on a queue should go up to the user space socket.
>>
>> * Daniel Borkmann's suggestion for a "copy to XDP socket, and return
>>   XDP_PASS" for a tcpdump-like functionality.
>>
>> * And of course getting to zero-copy support in small increments.
>>
>> Thanks: Björn and Magnus
>>
>> Björn Töpel (8):
>>   net: initial AF_XDP skeleton
>>   xsk: add user memory registration support sockopt
>>   xsk: add Rx queue setup and mmap support
>>   xdp: introduce xdp_return_buff API
>>   xsk: add Rx receive functions and poll support
>>   bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
>>   xsk: wire up XDP_DRV side of AF_XDP
>>   xsk: wire up XDP_SKB side of AF_XDP
>>
>> Magnus Karlsson (7):
>>   xsk: add umem fill queue support and mmap
>>   xsk: add support for bind for Rx
>>   xsk: add umem completion queue support and mmap
>>   xsk: add Tx queue setup and mmap support
>>   xsk: support for Tx
>>   xsk: statistics support
>>   samples/bpf: sample application for AF_XDP sockets
>>
>>  MAINTAINERS                         |   8 +
>>  include/linux/bpf.h                 |  26 +
>>  include/linux/bpf_types.h           |   3 +
>>  include/linux/filter.h              |   2 +-
>>  include/linux/socket.h              |   5 +-
>>  include/net/xdp.h                   |   1 +
>>  include/net/xdp_sock.h              |  46 ++
>>  include/uapi/linux/bpf.h            |   1 +
>>  include/uapi/linux/if_xdp.h         |  87 ++++
>>  kernel/bpf/Makefile                 |   3 +
>>  kernel/bpf/verifier.c               |   8 +-
>>  kernel/bpf/xskmap.c                 | 286 +++++++++++
>>  net/Kconfig                         |   1 +
>>  net/Makefile                        |   1 +
>>  net/core/dev.c                      |  34 +-
>>  net/core/filter.c                   |  40 +-
>>  net/core/sock.c                     |  12 +-
>>  net/core/xdp.c                      |  15 +-
>>  net/xdp/Kconfig                     |   7 +
>>  net/xdp/Makefile                    |   2 +
>>  net/xdp/xdp_umem.c                  | 256 ++++++++++
>>  net/xdp/xdp_umem.h                  |  65 +++
>>  net/xdp/xdp_umem_props.h            |  23 +
>>  net/xdp/xsk.c                       | 704 +++++++++++++++++++++++++++
>>  net/xdp/xsk_queue.c                 |  73 +++
>>  net/xdp/xsk_queue.h                 | 245 ++++++++++
>>  samples/bpf/Makefile                |   4 +
>>  samples/bpf/xdpsock.h               |  11 +
>>  samples/bpf/xdpsock_kern.c          |  56 +++
>>  samples/bpf/xdpsock_user.c          | 947 ++++++++++++++++++++++++++++++++++++
>>  security/selinux/hooks.c            |   4 +-
>>  security/selinux/include/classmap.h |   4 +-
>>  32 files changed, 2945 insertions(+), 35 deletions(-)
>>  create mode 100644 include/net/xdp_sock.h
>>  create mode 100644 include/uapi/linux/if_xdp.h
>>  create mode 100644 kernel/bpf/xskmap.c
>>  create mode 100644 net/xdp/Kconfig
>>  create mode 100644 net/xdp/Makefile
>>  create mode 100644 net/xdp/xdp_umem.c
>>  create mode 100644 net/xdp/xdp_umem.h
>>  create mode 100644 net/xdp/xdp_umem_props.h
>>  create mode 100644 net/xdp/xsk.c
>>  create mode 100644 net/xdp/xsk_queue.c
>>  create mode 100644 net/xdp/xsk_queue.h
>>  create mode 100644 samples/bpf/xdpsock.h
>>  create mode 100644 samples/bpf/xdpsock_kern.c
>>  create mode 100644 samples/bpf/xdpsock_user.c
>
> Is there a chance of Documentation/networking/af_xdp.txt ?
>

Yes. :-) We'll add that to the next spin!

>
>>
>> --
>> 2.14.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ