netdev - Re: [RFC PATCH 00/24] Introducing AF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ+HfNih3KQdodh5MXr7hkGTjwCmKZyeKw-ZX0tqGH_tNd3_Eg@mail.gmail.com>
Date:   Mon, 5 Feb 2018 16:05:28 +0100
From:   Björn Töpel <bjorn.topel@...il.com>
To:     Bjorn Topel <bjorn.topel@...il.com>,
        "Karlsson, Magnus" <magnus.karlsson@...el.com>,
        "Duyck, Alexander H" <alexander.h.duyck@...el.com>,
        Alexander Duyck <alexander.duyck@...il.com>,
        John Fastabend <john.fastabend@...il.com>,
        Alexei Starovoitov <ast@...com>,
        Jesper Dangaard Brouer <brouer@...hat.com>,
        Willem de Bruijn <willemdebruijn.kernel@...il.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Netdev <netdev@...r.kernel.org>
Cc:     Björn Töpel <bjorn.topel@...el.com>,
        michael.lundkvist@...csson.com,
        "Brandeburg, Jesse" <jesse.brandeburg@...el.com>,
        "Singhai, Anjali" <anjali.singhai@...el.com>,
        "Shaw, Jeffrey B" <jeffrey.b.shaw@...el.com>,
        "Yigit, Ferruh" <ferruh.yigit@...el.com>,
        "Zhang, Qi Z" <qi.z.zhang@...el.com>
Subject: Re: [RFC PATCH 00/24] Introducing AF_XDP support

2018-01-31 14:53 GMT+01:00 Björn Töpel <bjorn.topel@...il.com>:
> From: Björn Töpel <bjorn.topel@...el.com>
>
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and zero-copy
> semantics. Throughput improvements can be up to 20x compared to V2 and
> V3 for the micro benchmarks included. Would be great to get your
> feedback on it. Note that this is the follow up RFC to AF_PACKET V4
> from November last year. The feedback from that RFC submission and the
> presentation at NetdevConf in Seoul was to create a new address family
> instead of building on top of AF_PACKET. AF_XDP is this new address
> family.
>
> The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor
> level is that TX and RX descriptors are separated from packet
> buffers. An RX or TX descriptor points to a data buffer in a packet
> buffer area. RX and TX can share the same packet buffer so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, then
> the descriptor that points to that packet buffer can be changed to
> point to another buffer and reused right away. This again avoids
> copying data.
>
> The RX and TX descriptor rings are registered with the setsockopts
> XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer
> area is allocated by user space and registered with the kernel using
> the new XDP_MEM_REG setsockopt. All these three areas are shared
> between user space and kernel space. The socket is then bound with a
> bind() call to a device and a specific queue id on that device, and it
> is not until bind is completed that traffic starts to flow.
>
> An XDP program can be loaded to direct part of the traffic on that
> device and queue id to user space through a new redirect action in an
> XDP program called bpf_xdpsk_redirect that redirects a packet up to
> the socket in user space. All the other XDP actions work just as
> before. Note that the current RFC requires the user to load an XDP
> program to get any traffic to user space (for example all traffic to
> user space with the one-liner program "return
> bpf_xdpsk_redirect();"). We plan on introducing a patch that removes
> this requirement and sends all traffic from a queue to user space if
> an AF_XDP socket is bound to it.
>

We realized, a bit late maybe, that 24 patches is a bit mouthful, so
let me try to make it more palatable.

Patch 1 to 7 introduce AF_XDP socket support with copy semantics
(require no driver changes). Patch 8 adds XDP_REDIRECT support to i40e
and patch 9 is the test application.

The rest of the patches are enabling zero-copy support, and they're
messier. So, if you don't really care about zero-copy, just have a
look at 1 to 7.

We'd really appreciate your thoughts on the user space APIs (including
the bpf APIs).

For the next review, we'll keep the set smaller, and introduce many of
the i40e patches as pre-patches.


Björn


> AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and
> XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there
> is no specific mode called XDP_DRV_ZC). If the driver does not have
> support for XDP, or XDP_SKB is explicitly chosen when loading the XDP
> program, XDP_SKB mode is employed that uses SKBs together with the
> generic XDP support and copies out the data to user space. A fallback
> mode that works for any network device. On the other hand, if the
> driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and
> ndo_xdp_flush), these NDOs, without any modifications, will be used by
> the AF_XDP code to provide better performance, but there is still a
> copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP
> driver support with the zero-copy user space allocator that provides
> even better performance. In this mode, the networking HW (or SW driver
> if it is a virtual driver like veth) DMAs/puts packets straight into
> the packet buffer that is shared between user space and kernel
> space. The RX and TX descriptor queues of the networking HW are NOT
> shared to user space. Only the kernel can read and write these and it
> is the kernel driver's responsibility to translate these HW specific
> descriptors to the HW agnostic ones in the virtual descriptor rings
> that user space sees. This way, a malicious user space program cannot
> mess with the networking HW. This mode though requires some extensions
> to XDP.
>
> To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a
> buffer pool concept so that the same XDP driver code can be used for
> buffers allocated using the page allocator (XDP_DRV), the user-space
> zero-copy allocator (XDP_DRV_ZC), or some internal driver specific
> allocator/cache/recycling mechanism. The ndo_bpf call has also been
> extended with two commands for registering and unregistering an XSK
> socket and is in the RX case mainly used to communicate some
> information about the user-space buffer pool to the driver.
>
> For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush,
> but we run into problems with this (further discussion in the
> challenges section) and had to introduce a new NDO called
> ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice
> and an explicit queue id that packets should be sent out on. In
> contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be
> sent from the xdp socket (associated with the dev and queue
> combination that was provided with the NDO call) using a callback
> (get_tx_packet), and when they have been transmitted it uses another
> callback (tx_completion) to signal completion of packets. These
> callbacks are set via ndo_bpf in the new XDP_REGISTER_XSK
> command. ndo_xdp_xmit_xsk is exclusively used by the XDP socket code
> and thus does not clash with the XDP_REDIRECT use of
> ndo_xdp_xmit. This is one of the reasons that the XDP_DRV mode
> (without ZC) is currently not supported by TX. Please have a look at
> the challenges section for further discussions.
>
> The AF_XDP bind call acts on a queue pair (channel in ethtool speak),
> so the user needs to steer the traffic to the zero-copy enabled queue
> pair. Which queue to use, is up to the user.
>
> For an untrusted application, HW packet steering to a specific queue
> pair (the one associated with the application) is a requirement, as
> the application would otherwise be able to see other user space
> processes' packets. If the HW cannot support the required packet
> steering, XDP_DRV or XDP_SKB mode have to be used as they do not
> expose the NIC's packet buffer into user space as the packets are
> copied into user space from the NIC's packet buffer in the kernel.
>
> There is a xdpsock benchmarking/test application included. Say that
> you would like your UDP traffic from port 4242 to end up in queue 16,
> that we will enable AF_XDP on. Here, we use ethtool for this:
>
>       ethtool -N p3p2 rx-flow-hash udp4 fn
>       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>           action 16
>
> Running the l2fwd benchmark in XDP_DRV_ZC mode can then be done using:
>
>       samples/bpf/xdpsock -i p3p2 -q 16 -l -N
>
> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
> can be displayed with "-h", as usual.
>
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
> Intel I40E 40Gbit/s using the i40e driver.
>
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> byte packets, generated by commercial packet generator HW that is
> generating packets at full 40 Gbit/s line rate.
>
> XDP baseline numbers without this RFC:
> xdp_rxq_info --action XDP_DROP 31.3 Mpps
> xdp_rxq_info --action XDP_TX   16.7 Mpps
>
> XDP performance with this RFC i.e. with the buffer allocator:
> XDP_DROP 21.0 Mpps
> XDP_TX   11.9 Mpps
>
> AF_PACKET V4 performance from previous RFC on 4.14-rc7:
> Benchmark   V2     V3     V4     V4+ZC
> rxdrop      0.67   0.73   0.74   33.7
> txpush      0.98   0.98   0.91   19.6
> l2fwd       0.66   0.71   0.67   15.5
>
> AF_XDP performance:
> Benchmark   XDP_SKB   XDP_DRV    XDP_DRV_ZC (all in Mpps)
> rxdrop      3.3        11.6         16.9
> txpush      2.2         NA*         21.8
> l2fwd       1.7         NA*         10.4
>
> * NA since there is no XDP_DRV mode (without ZC) for TX in this RFC,
>   see challenges below.
>
> If we start by comparing XDP_SKB performance with copy mode in
> AF_PACKET V4, we can see that AF_XDP delivers 3-5 times the
> throughput, which is positive. We are also happy with the XDP_DRV
> performance that provides 11.6 Mpps for rxdrop, and should work on any
> driver implementing full XDP support. Now to the problematic part:
> XDP_DRV_ZC. The txpush (TX only) benchmark shows decent results at
> 21.8 Mpps and is better than it was for V4, even though we have spent
> no time optimizing the code in AF_XDP. (We did that in AF_PACKET V4.)
> But the RX performance is sliced by half, which is not good. The
> reason for this is, for the major part, the new buffer allocator which
> is used for RX ZC only (at this point, see todo section). If you take
> a look at the XDP baseline numbers, introducing the buffer pool
> allocator drops the performance by around 30% or 10 Mpps which is
> obviously not acceptable. We clearly need to give this code some
> overdue performance love. But the overhanging question is how much
> overhead it will produce in the end and if this will be
> acceptable. Another thing to note is that V4 provided 33.7 Mpps for
> rxdrop, but with AF_XDP we are quite unlikely to get above the
> XDP_DROP number of 31.3, since we are reusing the XDP infrastructure
> and driver code on the RX side. So in the end, the AF_XDP XDP_DRV_ZC
> numbers will likely be lower than the V4 ZC numbers.
>
> We based this patch set on net-next commit 91e6dd828425 ("ipmr: Fix
> ptrdiff_t print formatting").
>
> Challenges: areas we would really appreciate your help on and that we
> are having substantial problems with.
>
> * We would like to, if possible, use ndo_xdp_xmit and ndo_xdp_flush
>   instead of introducing another NDO in the form of
>   ndo_xdp_xmit_xsk. The first reason behind our ineptitude to be able
>   to accomplish this is that if both paths use ndo_xdp_xmit, they will
>   create a race as ndo_xdp_xmit currently does not contain any
>   locking. How to implement some type of mutual exclusion here without
>   resorting to slowing down the NDO with a lock? The second problem is
>   that the XDP_REDIRECT code implicitly assumes that core id = queue
>   id. AF_XDP, on the other hand, explicitly specifies a queue id that
>   has nothing to do with the core id (the application can run on any
>   core). How to reconcile these two views in one ndo? If these two
>   problems can be solved, then we would still need to introduce a
>   completion callback and a get_packet callback, but this seems to be
>   less challenging. This would also make it possible to run TX in the
>   XDP_DRV mode (with the default page allocator).
>
> * What should the buffer allocator look like and how to make it
>   generic enough so it can be used by all NIC vendors? Would be great
>   if you could take a look at it and come with suggestions. As you can
>   see from the change log, it took some effort to rewire the i40e code
>   to use the buff pool, and we expect the same to be true for many
>   other NICs. Ideas on how to introduce multiple allocator into XDP in
>   a less intrusive way would be highly appreciated. Another question
>   is how to create a buffer pool that gives rise to very little
>   overhead? We do not know if the current one can be optimized to have
>   an acceptable overhead as we have not started any optimization
>   effort. But we will give it a try during the next week or so to see
>   where it leads.
>
> * In this RFC, do not use an XDP_REDIRECT action other than
>   bpf_xdpsk_redirect for XDP_DRV_ZC. This is because a zero-copy
>   allocated buffer will then be sent to a cpu id / queue_pair through
>   ndo_xdp_xmit that does not know this has been ZC allocated. It will
>   then do a page_free on it and you will get a crash. How to extend
>   ndo_xdp_xmit with some free/completion function that could be called
>   instead of page_free?  Hopefully, the same solution can be used here
>   as in the first problem item in this section.
>
> Caveats with this RFC. In contrast to the last section, we believe we
> have solutions for these but we did not have time to fix them. We
> chose to show you all the code sooner than later, even though
> everything does not work. Sorry.
>
> * This RFC is more immature (read, has more bugs) than the AF_PACKET
>   V4 RFC. Some known mentioned here, others unknown.
>
> * We have done absolutely no optimization to this RFC. There is
>   (hopefully) some substantial low hanging fruit that we could fix
>   once we get to this, to improve XDP_DRV_ZC performance to levels
>   that we are not ashamed of and also bring the i40e driver to the
>   same performance levels it had before our changes, which is a must.
>
> * There is a race in the TX XSK clean up code in the i40e driver that
>   triggers a WARN_ON_ONCE. Clearly a bug that needs to be fixed. It
>   can be triggered by performing ifdown/ifup when the application is
>   running, or when changing the number of queues of the device
>   underneath the hood of the application. As a workaround, please
>   refrain from doing these two things without restarting the
>   application, as not all buffers will be returned in the TX
>   path. This bug can also be triggered when killing the application,
>   but has no negative effect in this case as the process will never
>   execute again.
>
> * Before this RFC, ndo_xdp_xmit triggered by an XDP_REDIRECT to a NIC
>   never modified the page count, so the redirect code could assume
>   that the page would still be valid after the NDO call. With the
>   introduction of the xsk_rcv path that is called as a result of an
>   XDP_REDIRECT to an AF_XDP socket, the page count will be decreased
>   if the page is copied out to user space, since we have no use for it
>   anymore. Our somewhat blunt solution to this is to make sure in the
>   i40e driver that the refcount is never under two. Note though, that
>   with the introduction of the buffer pool, this problem
>   disappears. This also means that XDP_DRV will not work out of the
>   box with a Niantic NIC, since it also needs this modification to
>   work. One question that we have is what should the semantics of
>   ndo_xdp_xmit be? Can we always assume that the page count will never
>   be changed by all possible netdevices that implement this NDO, or
>   should we remove this assumption to gain more device implementation
>   flexibility?
>
> To do:
>
> * Optimize performance. No optimization whatsoever was performed on
>   this RFC, in contrast to the previous one for AF_PACKET V4.
>
> * Kernel load module support.
>
> * Polling has not been implemented yet.
>
> * Optimize the user space sample application. It is simple but naive
>   at this point. The one for AF_PACKET V4 had a number of
>   optimizations that have not been introduced in the AF_XDP version.
>
> * Implement a way to pick the XDP_DRV mode even if XDP_DRV_ZC is
>   available. Would be nice to have for the sample application too.
>
> * Introduce a notifier chain for queue changes (caused by ethtool for
>   example). This would get rid of the error callback that we have at
>   this point.
>
> * Use one NAPI context for RX and another one for TX in i40e. This
>   would make it possible to run RX on one core and TX on another for
>   better performance. Today, they need to share a single core since
>   they share NAPI context.
>
> * Get rid of packet arrays (PA) and convert them to the buffer pool
>   allocator by transferring the necessary PA functionality into the
>   buffer pool. This has only been done for RX in ZC mode, while all
>   the other modes are still using packet arrays. Clearly, having two
>   structures with largely the same information is not a good thing.
>
> * Support for AF_XDP sockets without an XPD program loaded. In this
>   case all the traffic on a queue should go up to user space.
>
> * Support shared packet buffers
>
> * Support for packets spanning multiple frames
>
> Thanks: Björn and Magnus
>
> Björn Töpel (16):
>   xsk: AF_XDP sockets buildable skeleton
>   xsk: add user memory registration sockopt
>   xsk: added XDP_{R,T}X_RING sockopt and supporting structures
>   bpf: added bpf_xdpsk_redirect
>   net: wire up xsk support in the XDP_REDIRECT path
>   i40e: add support for XDP_REDIRECT
>   samples/bpf: added xdpsock program
>   xsk: add iterator functions to xsk_ring
>   i40e: introduce external allocator support
>   i40e: implemented page recycling buff_pool
>   i40e: start using recycling buff_pool
>   i40e: separated buff_pool interface from i40e implementaion
>   xsk: introduce xsk_buff_pool
>   xdp: added buff_pool support to struct xdp_buff
>   xsk: add support for zero copy Rx
>   i40e: implement xsk sub-commands in ndo_bpf for zero copy Rx
>
> Magnus Karlsson (8):
>   xsk: add bind support and introduce Rx functionality
>   xsk: introduce Tx functionality
>   netdevice: added XDP_{UN,}REGISTER_XSK command to ndo_bpf
>   netdevice: added ndo for transmitting a packet from an XDP socket
>   xsk: add support for zero copy Tx
>   i40e: introduced a clean_tx callback function
>   i40e: introduced Tx completion callbacks
>   i40e: Tx support for zero copy allocator
>
>  drivers/net/ethernet/intel/i40e/Makefile         |    3 +-
>  drivers/net/ethernet/intel/i40e/i40e.h           |   24 +
>  drivers/net/ethernet/intel/i40e/i40e_buff_pool.c |  580 +++++++++++
>  drivers/net/ethernet/intel/i40e/i40e_buff_pool.h |   15 +
>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c   |    1 -
>  drivers/net/ethernet/intel/i40e/i40e_main.c      |  541 +++++++++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c      |  906 +++++++++--------
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h      |  119 ++-
>  include/linux/buff_pool.h                        |  136 +++
>  include/linux/filter.h                           |    3 +-
>  include/linux/netdevice.h                        |   25 +
>  include/linux/socket.h                           |    5 +-
>  include/net/xdp.h                                |    1 +
>  include/net/xdp_sock.h                           |   60 ++
>  include/uapi/linux/bpf.h                         |    6 +-
>  include/uapi/linux/if_xdp.h                      |   72 ++
>  net/Kconfig                                      |    1 +
>  net/Makefile                                     |    1 +
>  net/core/dev.c                                   |   28 +-
>  net/core/filter.c                                |   88 +-
>  net/core/sock.c                                  |   12 +-
>  net/xdp/Kconfig                                  |    7 +
>  net/xdp/Makefile                                 |    1 +
>  net/xdp/xsk.c                                    | 1142 ++++++++++++++++++++++
>  net/xdp/xsk.h                                    |   31 +
>  net/xdp/xsk_buff.h                               |  161 +++
>  net/xdp/xsk_buff_pool.c                          |  225 +++++
>  net/xdp/xsk_buff_pool.h                          |   17 +
>  net/xdp/xsk_packet_array.c                       |   62 ++
>  net/xdp/xsk_packet_array.h                       |  399 ++++++++
>  net/xdp/xsk_ring.c                               |   61 ++
>  net/xdp/xsk_ring.h                               |  419 ++++++++
>  net/xdp/xsk_user_queue.h                         |   24 +
>  samples/bpf/Makefile                             |    4 +
>  samples/bpf/xdpsock_kern.c                       |   11 +
>  samples/bpf/xdpsock_queue.h                      |   62 ++
>  samples/bpf/xdpsock_user.c                       |  642 ++++++++++++
>  security/selinux/hooks.c                         |    4 +-
>  security/selinux/include/classmap.h              |    4 +-
>  tools/testing/selftests/bpf/bpf_helpers.h        |    2 +
>  40 files changed, 5408 insertions(+), 497 deletions(-)
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_buff_pool.c
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_buff_pool.h
>  create mode 100644 include/linux/buff_pool.h
>  create mode 100644 include/net/xdp_sock.h
>  create mode 100644 include/uapi/linux/if_xdp.h
>  create mode 100644 net/xdp/Kconfig
>  create mode 100644 net/xdp/Makefile
>  create mode 100644 net/xdp/xsk.c
>  create mode 100644 net/xdp/xsk.h
>  create mode 100644 net/xdp/xsk_buff.h
>  create mode 100644 net/xdp/xsk_buff_pool.c
>  create mode 100644 net/xdp/xsk_buff_pool.h
>  create mode 100644 net/xdp/xsk_packet_array.c
>  create mode 100644 net/xdp/xsk_packet_array.h
>  create mode 100644 net/xdp/xsk_ring.c
>  create mode 100644 net/xdp/xsk_ring.h
>  create mode 100644 net/xdp/xsk_user_queue.h
>  create mode 100644 samples/bpf/xdpsock_kern.c
>  create mode 100644 samples/bpf/xdpsock_queue.h
>  create mode 100644 samples/bpf/xdpsock_user.c
>
> --
> 2.14.1
>