[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5a121fc4-fb6c-c70b-d674-9bf13c325b64@redhat.com>
Date: Tue, 16 Nov 2021 10:43:25 +0100
From: Jesper Dangaard Brouer <jbrouer@...hat.com>
To: Ciara Loftus <ciara.loftus@...el.com>, netdev@...r.kernel.org,
bpf@...r.kernel.org
Cc: brouer@...hat.com, ast@...nel.org, daniel@...earbox.net,
davem@...emloft.net, kuba@...nel.org, hawk@...nel.org,
john.fastabend@...il.com, toke@...hat.com, bjorn@...nel.org,
magnus.karlsson@...el.com, jonathan.lemon@...il.com,
maciej.fijalkowski@...el.com
Subject: Re: [RFC PATCH bpf-next 0/8] XDP_REDIRECT_XSK and Batched AF_XDP Rx
On 16/11/2021 08.37, Ciara Loftus wrote:
> The common case for AF_XDP sockets (xsks) is creating a single xsk on a queue for sending and
> receiving frames as this is analogous to HW packet steering through RSS and other classification
> methods in the NIC. AF_XDP uses the xdp redirect infrastructure to direct packets to the socket. It
> was designed for the much more complicated case of DEVMAP xdp_redirects which directs traffic to
> another netdev and thus potentially another driver. In the xsk redirect case, by skipping the
> unnecessary parts of this common code we can significantly improve performance and pave the way
> for batching in the driver. This RFC proposes one such way to simplify the infrastructure which
> yields a 27% increase in throughput and a decrease in cycles per packet of 24 cycles [1]. The goal
> of this RFC is to start a discussion on how best to simplify the single-socket datapath while
> providing one method as an example.
>
> Current approach:
> 1. XSK pointer: an xsk is created and a handle to the xsk is stored in the XSKMAP.
> 2. XDP program: bpf_redirect_map helper triggers the XSKMAP lookup which stores the result (handle
> to the xsk) and the map type (XSKMAP) in the percpu bpf_redirect_info struct. The XDP_REDIRECT
> action is returned.
> 3. XDP_REDIRECT handling called by the driver: the map type (XSKMAP) is read from the
> bpf_redirect_info which selects the xsk_map_redirect path. The xsk pointer is retrieved from the
> bpf_redirect_info and the XDP descriptor is pushed to the xsk's Rx ring. The socket is added to a
> list for flushing later.
> 4. xdp_do_flush: iterate through the lists of all maps that can be used for redirect (CPUMAP,
> DEVMAP and XSKMAP). When XSKMAP is flushed, go through all xsks that had any traffic redirected to
> them and bump the Rx ring head pointer(s).
>
> For the end goal of submitting the descriptor to the Rx ring and bumping the head pointer of that
> ring, only some of these steps are needed. The rest is overhead. The bpf_redirect_map
> infrastructure is needed for all other redirect operations, but is not necessary when redirecting
> to a single AF_XDP socket. And similarly, flushing the list for every map type in step 4 is not
> necessary when only one socket needs to be flushed.
>
> Proposed approach:
> 1. XSK pointer: an xsk is created and a handle to the xsk is stored both in the XSKMAP and also the
> netdev_rx_queue struct.
> 2. XDP program: new bpf_redirect_xsk helper returns XDP_REDIRECT_XSK.
> 3. XDP_REDIRECT_XSK handling called by the driver: the xsk pointer is retrieved from the
> netdev_rx_queue struct and the XDP descriptor is pushed to the xsk's Rx ring.
> 4. xsk_flush: fetch the handle from the netdev_rx_queue and flush the xsk.
>
> This fast path is triggered on XDP_REDIRECT_XSK if:
> (i) AF_XDP socket SW Rx ring configured
> (ii) Exactly one xsk attached to the queue
> If any of these conditions are not met, fall back to the same behavior as the original approach:
> xdp_redirect_map. This is handled under-the-hood in the new bpf_xdp_redirect_xsk helper so the user
> does not need to be aware of these conditions.
>
> Batching:
> With this new approach it is possible to optimize the driver by submitting a batch of descriptors
> to the Rx ring in Step 3 of the new approach by simply verifying that the action returned from
> every program run of each packet in a batch equals XDP_REDIRECT_XSK. That's because with this
> action we know the socket to redirect to will be the same for each packet in the batch. This is
> not possible with XDP_REDIRECT because the xsk pointer is stored in the bpf_redirect_info and not
> guaranteed to be the same for every packet in a batch.
>
> [1] Performance:
> The benchmarks were performed on VM running a 2.4GHz Ice Lake host with an i40e device passed
> through. The xdpsock app was run on a single core with busy polling and configured in 'rxonly' mode.
> ./xdpsock -i <iface> -r -B
> The improvement in throughput when using the new bpf helper and XDP action was measured at ~13% for
> scalar processing, with reduction in cycles per packet of ~13. A further ~14% improvement in
> throughput and reduction of ~11 cycles per packet was measured when the batched i40e driver path
> was used, for a total improvement of ~27% in throughput and reduction of ~24 cycles per packet.
>
> Other approaches considered:
> Two other approaches were considered. The advantage of both being that neither involved introducing
> a new XDP action. The first alternative approach considered was to create a new map type
> BPF_MAP_TYPE_XSKMAP_DIRECT. When the XDP_REDIRECT action was returned, this map type could be
> checked and used as an indicator to skip the map lookup and use the netdev_rx_queue xsk instead.
> The second approach considered was similar and involved using a new bpf_redirect_info flag which
> could be used in a similar fashion.
> While both approaches yielded a performance improvement they were measured at about half of what
> was measured for the approach outlined in this RFC. It seems using bpf_redirect_info is too
> expensive.
I think it was Bjørn that discovered that accessing the per CPU
bpf_redirect_info struct have an overhead of approx 2 ns (times 2.4GHz
~4.8 cycles). Your reduction in cycles per packet was ~13, where ~4.8
seem to be large.
The code access this_cpu_ptr(&bpf_redirect_info) two times.
One time in the BPF-helper redirect call and second in xdp_do_redirect.
(Hint xdp_redirect_map end-up calling __bpf_xdp_redirect_map)
Thus, it seems (as you say), the bpf_redirect_info approach is too
expensive. May be should look at storing bpf_redirect_info in a place
that doesn't requires the this_cpu_ptr() lookup... or cache the lookup
per NAPI cycle.
Have you tried this?
> Also, centralised processing of XDP actions was investigated. This would involve porting all drivers
> to a common interface for handling XDP actions which would greatly simplify the work involved in
> adding support for new XDP actions such as XDP_REDIRECT_XSK. However it was deemed at this point to
> be more complex than adding support for the new action to every driver. Should this series be
> considered worth pursuing for a proper patch set, the intention would be to update each driver
> individually.
I'm fine with adding a new helper, but I don't like introducing a new
XDP_REDIRECT_XSK action, which requires updating ALL the drivers.
With XDP_REDIRECT infra we beleived we didn't need to add more
XDP-action code to drivers, as we multiplex/add new features by
extending the bpf_redirect_info.
In this extreme performance case, it seems the this_cpu_ptr "lookup" of
bpf_redirect_info is the performance issue itself.
Could you experiement with different approaches that modify
xdp_do_redirect() to handle if new helper bpf_redirect_xsk was called,
prior to this_cpu_ptr() call.
(Thus, avoiding to introduce a new XDP-action).
> Thank you to Magnus Karlsson and Maciej Fijalkowski for several suggestions and insight provided.
>
> TODO:
> * Add selftest(s)
> * Add support for all copy and zero copy drivers
> * Libxdp support
>
> The series applies on commit e5043894b21f ("bpftool: Use libbpf_get_error() to check error")
>
> Thanks,
> Ciara
>
> Ciara Loftus (8):
> xsk: add struct xdp_sock to netdev_rx_queue
> bpf: add bpf_redirect_xsk helper and XDP_REDIRECT_XSK action
> xsk: handle XDP_REDIRECT_XSK and expose xsk_rcv/flush
> i40e: handle the XDP_REDIRECT_XSK action
> xsk: implement a batched version of xsk_rcv
> i40e: isolate descriptor processing in separate function
> i40e: introduce batched XDP rx descriptor processing
> libbpf: use bpf_redirect_xsk in the default program
>
> drivers/net/ethernet/intel/i40e/i40e_txrx.c | 13 +-
> .../ethernet/intel/i40e/i40e_txrx_common.h | 1 +
> drivers/net/ethernet/intel/i40e/i40e_xsk.c | 285 +++++++++++++++---
> include/linux/netdevice.h | 2 +
> include/net/xdp_sock_drv.h | 49 +++
> include/net/xsk_buff_pool.h | 22 ++
> include/uapi/linux/bpf.h | 13 +
> kernel/bpf/verifier.c | 7 +-
> net/core/dev.c | 14 +
> net/core/filter.c | 26 ++
> net/xdp/xsk.c | 69 ++++-
> net/xdp/xsk_queue.h | 31 ++
> tools/include/uapi/linux/bpf.h | 13 +
> tools/lib/bpf/xsk.c | 50 ++-
> 14 files changed, 551 insertions(+), 44 deletions(-)
>
Powered by blists - more mailing lists