netdev - Re: [RFC PATCH 00/24] Introducing AF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 7 Feb 2018 09:59:06 -0800
From:   Tom Herbert <tom@...bertland.com>
To:     Björn Töpel <bjorn.topel@...il.com>
Cc:     magnus.karlsson@...el.com,
        Alexander Duyck <alexander.h.duyck@...el.com>,
        Alexander Duyck <alexander.duyck@...il.com>,
        john fastabend <john.fastabend@...il.com>,
        Alexei Starovoitov <ast@...com>,
        Jesper Dangaard Brouer <brouer@...hat.com>,
        Willem de Bruijn <willemdebruijn.kernel@...il.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Björn Töpel <bjorn.topel@...el.com>,
        michael.lundkvist@...csson.com,
        Jesse Brandeburg <jesse.brandeburg@...el.com>,
        Anjali Singhai Jain <anjali.singhai@...el.com>,
        jeffrey.b.shaw@...el.com, ferruh.yigit@...el.com,
        qi.z.zhang@...el.com
Subject: Re: [RFC PATCH 00/24] Introducing AF_XDP support

On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel <bjorn.topel@...il.com> wrote:
> From: Björn Töpel <bjorn.topel@...el.com>
>
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and zero-copy
> semantics. Throughput improvements can be up to 20x compared to V2 and
> V3 for the micro benchmarks included. Would be great to get your
> feedback on it. Note that this is the follow up RFC to AF_PACKET V4
> from November last year. The feedback from that RFC submission and the
> presentation at NetdevConf in Seoul was to create a new address family
> instead of building on top of AF_PACKET. AF_XDP is this new address
> family.
>
> The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor
> level is that TX and RX descriptors are separated from packet
> buffers. An RX or TX descriptor points to a data buffer in a packet
> buffer area. RX and TX can share the same packet buffer so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, then
> the descriptor that points to that packet buffer can be changed to
> point to another buffer and reused right away. This again avoids
> copying data.
>
> The RX and TX descriptor rings are registered with the setsockopts
> XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer
> area is allocated by user space and registered with the kernel using
> the new XDP_MEM_REG setsockopt. All these three areas are shared
> between user space and kernel space. The socket is then bound with a
> bind() call to a device and a specific queue id on that device, and it
> is not until bind is completed that traffic starts to flow.
>
> An XDP program can be loaded to direct part of the traffic on that
> device and queue id to user space through a new redirect action in an
> XDP program called bpf_xdpsk_redirect that redirects a packet up to
> the socket in user space. All the other XDP actions work just as
> before. Note that the current RFC requires the user to load an XDP
> program to get any traffic to user space (for example all traffic to
> user space with the one-liner program "return
> bpf_xdpsk_redirect();"). We plan on introducing a patch that removes
> this requirement and sends all traffic from a queue to user space if
> an AF_XDP socket is bound to it.
>
> AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and
> XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there
> is no specific mode called XDP_DRV_ZC). If the driver does not have
> support for XDP, or XDP_SKB is explicitly chosen when loading the XDP
> program, XDP_SKB mode is employed that uses SKBs together with the
> generic XDP support and copies out the data to user space. A fallback
> mode that works for any network device. On the other hand, if the
> driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and
> ndo_xdp_flush), these NDOs, without any modifications, will be used by
> the AF_XDP code to provide better performance, but there is still a
> copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP
> driver support with the zero-copy user space allocator that provides
> even better performance. In this mode, the networking HW (or SW driver
> if it is a virtual driver like veth) DMAs/puts packets straight into
> the packet buffer that is shared between user space and kernel
> space. The RX and TX descriptor queues of the networking HW are NOT
> shared to user space. Only the kernel can read and write these and it
> is the kernel driver's responsibility to translate these HW specific
> descriptors to the HW agnostic ones in the virtual descriptor rings
> that user space sees. This way, a malicious user space program cannot
> mess with the networking HW. This mode though requires some extensions
> to XDP.
>
> To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a
> buffer pool concept so that the same XDP driver code can be used for
> buffers allocated using the page allocator (XDP_DRV), the user-space
> zero-copy allocator (XDP_DRV_ZC), or some internal driver specific
> allocator/cache/recycling mechanism. The ndo_bpf call has also been
> extended with two commands for registering and unregistering an XSK
> socket and is in the RX case mainly used to communicate some
> information about the user-space buffer pool to the driver.
>
> For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush,
> but we run into problems with this (further discussion in the
> challenges section) and had to introduce a new NDO called
> ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice
> and an explicit queue id that packets should be sent out on. In
> contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be
> sent from the xdp socket (associated with the dev and queue
> combination that was provided with the NDO call) using a callback
> (get_tx_packet), and when they have been transmitted it uses another
> callback (tx_completion) to signal completion of packets. These
> callbacks are set via ndo_bpf in the new XDP_REGISTER_XSK
> command. ndo_xdp_xmit_xsk is exclusively used by the XDP socket code
> and thus does not clash with the XDP_REDIRECT use of
> ndo_xdp_xmit. This is one of the reasons that the XDP_DRV mode
> (without ZC) is currently not supported by TX. Please have a look at
> the challenges section for further discussions.
>
> The AF_XDP bind call acts on a queue pair (channel in ethtool speak),
> so the user needs to steer the traffic to the zero-copy enabled queue
> pair. Which queue to use, is up to the user.
>
> For an untrusted application, HW packet steering to a specific queue
> pair (the one associated with the application) is a requirement, as
> the application would otherwise be able to see other user space
> processes' packets. If the HW cannot support the required packet
> steering, XDP_DRV or XDP_SKB mode have to be used as they do not
> expose the NIC's packet buffer into user space as the packets are
> copied into user space from the NIC's packet buffer in the kernel.
>
> There is a xdpsock benchmarking/test application included. Say that
> you would like your UDP traffic from port 4242 to end up in queue 16,
> that we will enable AF_XDP on. Here, we use ethtool for this:
>
>       ethtool -N p3p2 rx-flow-hash udp4 fn
>       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>           action 16
>
> Running the l2fwd benchmark in XDP_DRV_ZC mode can then be done using:
>
>       samples/bpf/xdpsock -i p3p2 -q 16 -l -N
>
> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
> can be displayed with "-h", as usual.
>
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
> Intel I40E 40Gbit/s using the i40e driver.
>
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> byte packets, generated by commercial packet generator HW that is
> generating packets at full 40 Gbit/s line rate.
>
> XDP baseline numbers without this RFC:
> xdp_rxq_info --action XDP_DROP 31.3 Mpps
> xdp_rxq_info --action XDP_TX   16.7 Mpps
>
> XDP performance with this RFC i.e. with the buffer allocator:
> XDP_DROP 21.0 Mpps
> XDP_TX   11.9 Mpps
>
> AF_PACKET V4 performance from previous RFC on 4.14-rc7:
> Benchmark   V2     V3     V4     V4+ZC
> rxdrop      0.67   0.73   0.74   33.7
> txpush      0.98   0.98   0.91   19.6
> l2fwd       0.66   0.71   0.67   15.5
>
> AF_XDP performance:
> Benchmark   XDP_SKB   XDP_DRV    XDP_DRV_ZC (all in Mpps)
> rxdrop      3.3        11.6         16.9
> txpush      2.2         NA*         21.8
> l2fwd       1.7         NA*         10.4
>
Hi Bjorn,

This is very impressive work, thank you for contributing it!

For these benchmarks how are the AF_PACKET and AF_XDP numbers to be
compared. For instance is rxdpop comparable to XDP_DROP and
"xdp_rxq_info --action XDP_DROP"? Given your explanation below, I
believe they are, but it might be better to make that clear up front.

Tom


> * NA since there is no XDP_DRV mode (without ZC) for TX in this RFC,
>   see challenges below.
>
> If we start by comparing XDP_SKB performance with copy mode in
> AF_PACKET V4, we can see that AF_XDP delivers 3-5 times the
> throughput, which is positive. We are also happy with the XDP_DRV
> performance that provides 11.6 Mpps for rxdrop, and should work on any
> driver implementing full XDP support. Now to the problematic part:
> XDP_DRV_ZC. The txpush (TX only) benchmark shows decent results at
> 21.8 Mpps and is better than it was for V4, even though we have spent
> no time optimizing the code in AF_XDP. (We did that in AF_PACKET V4.)
> But the RX performance is sliced by half, which is not good. The
> reason for this is, for the major part, the new buffer allocator which
> is used for RX ZC only (at this point, see todo section). If you take
> a look at the XDP baseline numbers, introducing the buffer pool
> allocator drops the performance by around 30% or 10 Mpps which is
> obviously not acceptable. We clearly need to give this code some
> overdue performance love. But the overhanging question is how much
> overhead it will produce in the end and if this will be
> acceptable. Another thing to note is that V4 provided 33.7 Mpps for
> rxdrop, but with AF_XDP we are quite unlikely to get above the
> XDP_DROP number of 31.3, since we are reusing the XDP infrastructure
> and driver code on the RX side. So in the end, the AF_XDP XDP_DRV_ZC
> numbers will likely be lower than the V4 ZC numbers.
>
> We based this patch set on net-next commit 91e6dd828425 ("ipmr: Fix
> ptrdiff_t print formatting").
>
> Challenges: areas we would really appreciate your help on and that we
> are having substantial problems with.
>
> * We would like to, if possible, use ndo_xdp_xmit and ndo_xdp_flush
>   instead of introducing another NDO in the form of
>   ndo_xdp_xmit_xsk. The first reason behind our ineptitude to be able
>   to accomplish this is that if both paths use ndo_xdp_xmit, they will
>   create a race as ndo_xdp_xmit currently does not contain any
>   locking. How to implement some type of mutual exclusion here without
>   resorting to slowing down the NDO with a lock? The second problem is
>   that the XDP_REDIRECT code implicitly assumes that core id = queue
>   id. AF_XDP, on the other hand, explicitly specifies a queue id that
>   has nothing to do with the core id (the application can run on any
>   core). How to reconcile these two views in one ndo? If these two
>   problems can be solved, then we would still need to introduce a
>   completion callback and a get_packet callback, but this seems to be
>   less challenging. This would also make it possible to run TX in the
>   XDP_DRV mode (with the default page allocator).
>
> * What should the buffer allocator look like and how to make it
>   generic enough so it can be used by all NIC vendors? Would be great
>   if you could take a look at it and come with suggestions. As you can
>   see from the change log, it took some effort to rewire the i40e code
>   to use the buff pool, and we expect the same to be true for many
>   other NICs. Ideas on how to introduce multiple allocator into XDP in
>   a less intrusive way would be highly appreciated. Another question
>   is how to create a buffer pool that gives rise to very little
>   overhead? We do not know if the current one can be optimized to have
>   an acceptable overhead as we have not started any optimization
>   effort. But we will give it a try during the next week or so to see
>   where it leads.
>
> * In this RFC, do not use an XDP_REDIRECT action other than
>   bpf_xdpsk_redirect for XDP_DRV_ZC. This is because a zero-copy
>   allocated buffer will then be sent to a cpu id / queue_pair through
>   ndo_xdp_xmit that does not know this has been ZC allocated. It will
>   then do a page_free on it and you will get a crash. How to extend
>   ndo_xdp_xmit with some free/completion function that could be called
>   instead of page_free?  Hopefully, the same solution can be used here
>   as in the first problem item in this section.
>
> Caveats with this RFC. In contrast to the last section, we believe we
> have solutions for these but we did not have time to fix them. We
> chose to show you all the code sooner than later, even though
> everything does not work. Sorry.
>
> * This RFC is more immature (read, has more bugs) than the AF_PACKET
>   V4 RFC. Some known mentioned here, others unknown.
>
> * We have done absolutely no optimization to this RFC. There is
>   (hopefully) some substantial low hanging fruit that we could fix
>   once we get to this, to improve XDP_DRV_ZC performance to levels
>   that we are not ashamed of and also bring the i40e driver to the
>   same performance levels it had before our changes, which is a must.
>
> * There is a race in the TX XSK clean up code in the i40e driver that
>   triggers a WARN_ON_ONCE. Clearly a bug that needs to be fixed. It
>   can be triggered by performing ifdown/ifup when the application is
>   running, or when changing the number of queues of the device
>   underneath the hood of the application. As a workaround, please
>   refrain from doing these two things without restarting the
>   application, as not all buffers will be returned in the TX
>   path. This bug can also be triggered when killing the application,
>   but has no negative effect in this case as the process will never
>   execute again.
>
> * Before this RFC, ndo_xdp_xmit triggered by an XDP_REDIRECT to a NIC
>   never modified the page count, so the redirect code could assume
>   that the page would still be valid after the NDO call. With the
>   introduction of the xsk_rcv path that is called as a result of an
>   XDP_REDIRECT to an AF_XDP socket, the page count will be decreased
>   if the page is copied out to user space, since we have no use for it
>   anymore. Our somewhat blunt solution to this is to make sure in the
>   i40e driver that the refcount is never under two. Note though, that
>   with the introduction of the buffer pool, this problem
>   disappears. This also means that XDP_DRV will not work out of the
>   box with a Niantic NIC, since it also needs this modification to
>   work. One question that we have is what should the semantics of
>   ndo_xdp_xmit be? Can we always assume that the page count will never
>   be changed by all possible netdevices that implement this NDO, or
>   should we remove this assumption to gain more device implementation
>   flexibility?
>
> To do:
>
> * Optimize performance. No optimization whatsoever was performed on
>   this RFC, in contrast to the previous one for AF_PACKET V4.
>
> * Kernel load module support.
>
> * Polling has not been implemented yet.
>
> * Optimize the user space sample application. It is simple but naive
>   at this point. The one for AF_PACKET V4 had a number of
>   optimizations that have not been introduced in the AF_XDP version.
>
> * Implement a way to pick the XDP_DRV mode even if XDP_DRV_ZC is
>   available. Would be nice to have for the sample application too.
>
> * Introduce a notifier chain for queue changes (caused by ethtool for
>   example). This would get rid of the error callback that we have at
>   this point.
>
> * Use one NAPI context for RX and another one for TX in i40e. This
>   would make it possible to run RX on one core and TX on another for
>   better performance. Today, they need to share a single core since
>   they share NAPI context.
>
> * Get rid of packet arrays (PA) and convert them to the buffer pool
>   allocator by transferring the necessary PA functionality into the
>   buffer pool. This has only been done for RX in ZC mode, while all
>   the other modes are still using packet arrays. Clearly, having two
>   structures with largely the same information is not a good thing.
>
> * Support for AF_XDP sockets without an XPD program loaded. In this
>   case all the traffic on a queue should go up to user space.
>
> * Support shared packet buffers
>
> * Support for packets spanning multiple frames
>
> Thanks: Björn and Magnus
>
> Björn Töpel (16):
>   xsk: AF_XDP sockets buildable skeleton
>   xsk: add user memory registration sockopt
>   xsk: added XDP_{R,T}X_RING sockopt and supporting structures
>   bpf: added bpf_xdpsk_redirect
>   net: wire up xsk support in the XDP_REDIRECT path
>   i40e: add support for XDP_REDIRECT
>   samples/bpf: added xdpsock program
>   xsk: add iterator functions to xsk_ring
>   i40e: introduce external allocator support
>   i40e: implemented page recycling buff_pool
>   i40e: start using recycling buff_pool
>   i40e: separated buff_pool interface from i40e implementaion
>   xsk: introduce xsk_buff_pool
>   xdp: added buff_pool support to struct xdp_buff
>   xsk: add support for zero copy Rx
>   i40e: implement xsk sub-commands in ndo_bpf for zero copy Rx
>
> Magnus Karlsson (8):
>   xsk: add bind support and introduce Rx functionality
>   xsk: introduce Tx functionality
>   netdevice: added XDP_{UN,}REGISTER_XSK command to ndo_bpf
>   netdevice: added ndo for transmitting a packet from an XDP socket
>   xsk: add support for zero copy Tx
>   i40e: introduced a clean_tx callback function
>   i40e: introduced Tx completion callbacks
>   i40e: Tx support for zero copy allocator
>
>  drivers/net/ethernet/intel/i40e/Makefile         |    3 +-
>  drivers/net/ethernet/intel/i40e/i40e.h           |   24 +
>  drivers/net/ethernet/intel/i40e/i40e_buff_pool.c |  580 +++++++++++
>  drivers/net/ethernet/intel/i40e/i40e_buff_pool.h |   15 +
>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c   |    1 -
>  drivers/net/ethernet/intel/i40e/i40e_main.c      |  541 +++++++++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c      |  906 +++++++++--------
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h      |  119 ++-
>  include/linux/buff_pool.h                        |  136 +++
>  include/linux/filter.h                           |    3 +-
>  include/linux/netdevice.h                        |   25 +
>  include/linux/socket.h                           |    5 +-
>  include/net/xdp.h                                |    1 +
>  include/net/xdp_sock.h                           |   60 ++
>  include/uapi/linux/bpf.h                         |    6 +-
>  include/uapi/linux/if_xdp.h                      |   72 ++
>  net/Kconfig                                      |    1 +
>  net/Makefile                                     |    1 +
>  net/core/dev.c                                   |   28 +-
>  net/core/filter.c                                |   88 +-
>  net/core/sock.c                                  |   12 +-
>  net/xdp/Kconfig                                  |    7 +
>  net/xdp/Makefile                                 |    1 +
>  net/xdp/xsk.c                                    | 1142 ++++++++++++++++++++++
>  net/xdp/xsk.h                                    |   31 +
>  net/xdp/xsk_buff.h                               |  161 +++
>  net/xdp/xsk_buff_pool.c                          |  225 +++++
>  net/xdp/xsk_buff_pool.h                          |   17 +
>  net/xdp/xsk_packet_array.c                       |   62 ++
>  net/xdp/xsk_packet_array.h                       |  399 ++++++++
>  net/xdp/xsk_ring.c                               |   61 ++
>  net/xdp/xsk_ring.h                               |  419 ++++++++
>  net/xdp/xsk_user_queue.h                         |   24 +
>  samples/bpf/Makefile                             |    4 +
>  samples/bpf/xdpsock_kern.c                       |   11 +
>  samples/bpf/xdpsock_queue.h                      |   62 ++
>  samples/bpf/xdpsock_user.c                       |  642 ++++++++++++
>  security/selinux/hooks.c                         |    4 +-
>  security/selinux/include/classmap.h              |    4 +-
>  tools/testing/selftests/bpf/bpf_helpers.h        |    2 +
>  40 files changed, 5408 insertions(+), 497 deletions(-)
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_buff_pool.c
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_buff_pool.h
>  create mode 100644 include/linux/buff_pool.h
>  create mode 100644 include/net/xdp_sock.h
>  create mode 100644 include/uapi/linux/if_xdp.h
>  create mode 100644 net/xdp/Kconfig
>  create mode 100644 net/xdp/Makefile
>  create mode 100644 net/xdp/xsk.c
>  create mode 100644 net/xdp/xsk.h
>  create mode 100644 net/xdp/xsk_buff.h
>  create mode 100644 net/xdp/xsk_buff_pool.c
>  create mode 100644 net/xdp/xsk_buff_pool.h
>  create mode 100644 net/xdp/xsk_packet_array.c
>  create mode 100644 net/xdp/xsk_packet_array.h
>  create mode 100644 net/xdp/xsk_ring.c
>  create mode 100644 net/xdp/xsk_ring.h
>  create mode 100644 net/xdp/xsk_user_queue.h
>  create mode 100644 samples/bpf/xdpsock_kern.c
>  create mode 100644 samples/bpf/xdpsock_queue.h
>  create mode 100644 samples/bpf/xdpsock_user.c
>
> --
> 2.14.1
>