netdev - Re: [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALx6S36XRQDtk_HoYMaK2td7=EQO_Zuw2L7qNToU16UJdBBUUQ@mail.gmail.com>
Date:   Tue, 26 Jun 2018 13:48:41 -0700
From:   Tom Herbert <tom@...bertland.com>
To:     Edward Cree <ecree@...arflare.com>
Cc:     linux-net-drivers@...arflare.com,
        Linux Kernel Network Developers <netdev@...r.kernel.org>,
        "David S. Miller" <davem@...emloft.net>
Subject: Re: [RFC PATCH v2 net-next 00/12] Handle multiple received packets at
 each stage

On Tue, Jun 26, 2018 at 11:15 AM, Edward Cree <ecree@...arflare.com> wrote:
>
> This patch series adds the capability for the network stack to receive a
>  list of packets and process them as a unit, rather than handling each
>  packet singly in sequence.  This is done by factoring out the existing
>  datapath code at each layer and wrapping it in list handling code.
>
> The motivation for this change is twofold:
> * Instruction cache locality.  Currently, running the entire network
>   stack receive path on a packet involves more code than will fit in the
>   lowest-level icache, meaning that when the next packet is handled, the
>   code has to be reloaded from more distant caches.  By handling packets
>   in "row-major order", we ensure that the code at each layer is hot for
>   most of the list.  (There is a corresponding downside in _data_ cache
>   locality, since we are now touching every packet at every layer, but in
>   practice there is easily enough room in dcache to hold one cacheline of
>   each of the 64 packets in a NAPI poll.)
> * Reduction of indirect calls.  Owing to Spectre mitigations, indirect
>   function calls are now more expensive than ever; they are also heavily
>   used in the network stack's architecture (see [1]).  By replacing 64
>   indirect calls to the next-layer per-packet function with a single
>   indirect call to the next-layer list function, we can save CPU cycles.
>
> Drivers pass an SKB list to the stack at the end of the NAPI poll; this
>  gives a natural batch size (the NAPI poll weight) and avoids waiting at
>  the software level for further packets to make a larger batch (which
>  would add latency).  It also means that the batch size is automatically
>  tuned by the existing interrupt moderation mechanism.
> The stack then runs each layer of processing over all the packets in the
>  list before proceeding to the next layer.  Where the 'next layer' (or
>  the context in which it must run) differs among the packets, the stack
>  splits the list; this 'late demux' means that packets which differ only
>  in later headers (e.g. same L2/L3 but different L4) can traverse the
>  early part of the stack together.
> Also, where the next layer is not (yet) list-aware, the stack can revert
>  to calling the rest of the stack in a loop; this allows gradual/creeping
>  listification, with no 'flag day' patch needed to listify everything.
>
> Patches 1-2 simply place received packets on a list during the event
>  processing loop on the sfc EF10 architecture, then call the normal stack
>  for each packet singly at the end of the NAPI poll.  (Analogues of patch
>  #2 for other NIC drivers should be fairly straightforward.)
> Patches 3-9 extend the list processing as far as the IP receive handler.
> Patches 10-12 apply the list techniques to Generic XDP, since the bpf_func
>  there is an indirect call.  In patch #12 we JIT a list_func that performs
>  list unwrapping and makes direct calls to the bpf_func.
>
> Patches 1-2 alone give about a 10% improvement in packet rate in the
>  baseline test; adding patches 3-9 raises this to around 25%.  Patches 10-
>  12, intended to improve Generic XDP performance, have in fact slightly
>  worsened it; I am unsure why this is and have included them in this RFC
>  in the hopes that someone will spot the reason.  If no progress is made I
>  will drop them from the series.
>
> Performance measurements were made with NetPerf UDP_STREAM, using 1-byte
>  packets and a single core to handle interrupts on the RX side; this was
>  in order to measure as simply as possible the packet rate handled by a
>  single core.  Figures are in Mbit/s; divide by 8 to obtain Mpps.  The
>  setup was tuned for maximum reproducibility, rather than raw performance.
>  Full details and more results (both with and without retpolines) are
>  presented in [2].
>
> The baseline test uses four streams, and multiple RXQs all bound to a
>  single CPU (the netperf binary is bound to a neighbouring CPU).  These
>  tests were run with retpolines.
> net-next: 6.60 Mb/s (datum)
>  after 9: 8.35 Mb/s (datum + 26.6%)
> after 12: 8.29 Mb/s (datum + 25.6%)
> Note however that these results are not robust; changes in the parameters
>  of the test often shrink the gain to single-digit percentages.  For
>  instance, when using only a single RXQ, only a 4% gain was seen.  The
>  results also seem to change significantly each time the patch series is
>  rebased onto a new net-next; for instance the results in [3] with
>  retpolines (slide 9) show only 11.6% gain in the same test as above (the
>  post-patch performance is the same but the pre-patch datum is 7.5Mb/s).
>
Very nice! I really like the deliberate progression of functionality
in the patches makes follwing them very readable. I do think that XDP
related patches at the end of the set should be separated out.

I suspects the effects will vary a lot between architectures and
configuration, so I'm not too worried about the variance mentioned in
the performance numbers. For future work, it might also be worth it to
compare to techniques done in VPP.

Tom

>
> I also performed tests with Generic XDP enabled (using a simple map-based
>  UDP port drop program with no entries in the map), both with and without
>  the eBPF JIT enabled.
> No JIT:
> net-next: 3.52 Mb/s (datum)
>  after 9: 4.91 Mb/s (datum + 39.5%)
> after 12: 4.83 Mb/s (datum + 37.3%)
>
> With JIT:
> net-next: 5.23 Mb/s (datum)
>  after 9: 6.64 Mb/s (datum + 27.0%)
> after 12: 6.46 Mb/s (datum + 23.6%)
>
> Another test variation was the use of software filtering/firewall rules.
>  Adding a single iptables rule (a UDP port drop on a port range not
>  matching the test traffic), thus making the netfilter hook have work to
>  do, reduced baseline performance but showed a similar delta from the
>  patches.  Similarly, testing with a set of TC flower filters (kindly
>  supplied by Cong Wang) in the single-RXQ setup (that previously gave 4%)
>  slowed down the baseline but not the patched performance, giving a 5.7%
>  performance delta.  These data suggest that the batching approach
>  remains effective in the presence of software switching rules.
>
> Changes from v1 (see [3]):
> * Rebased across 2 years' net-next movement (surprisingly straightforward).
>   - Added Generic XDP handling to netif_receive_skb_list_internal()
>   - Dealt with changes to PFMEMALLOC setting APIs
> * General cleanup of code and comments.
> * Skipped function calls for empty lists at various points in the stack
>   (patch #9).
> * Added listified Generic XDP handling (patches 10-12), though it doesn't
>   seem to help (see above).
> * Extended testing to cover software firewalls / netfilter etc.
>
> [1] http://vger.kernel.org/netconf2018_files/DavidMiller_netconf2018.pdf
> [2] http://vger.kernel.org/netconf2018_files/EdwardCree_netconf2018.pdf
> [3] http://lists.openwall.net/netdev/2016/04/19/89
>
> Edward Cree (12):
>   net: core: trivial netif_receive_skb_list() entry point
>   sfc: batch up RX delivery
>   net: core: unwrap skb list receive slightly further
>   net: core: Another step of skb receive list processing
>   net: core: another layer of lists, around PF_MEMALLOC skb handling
>   net: core: propagate SKB lists through packet_type lookup
>   net: ipv4: listified version of ip_rcv
>   net: ipv4: listify ip_rcv_finish
>   net: don't bother calling list RX functions on empty lists
>   net: listify Generic XDP processing, part 1
>   net: listify Generic XDP processing, part 2
>   net: listify jited Generic XDP processing on x86_64
>
>  arch/x86/net/bpf_jit_comp.c           | 164 ++++++++++++++
>  drivers/net/ethernet/sfc/efx.c        |  12 +
>  drivers/net/ethernet/sfc/net_driver.h |   3 +
>  drivers/net/ethernet/sfc/rx.c         |   7 +-
>  include/linux/filter.h                |  43 +++-
>  include/linux/netdevice.h             |   4 +
>  include/linux/netfilter.h             |  27 +++
>  include/linux/skbuff.h                |  16 ++
>  include/net/ip.h                      |   2 +
>  include/trace/events/net.h            |  14 ++
>  kernel/bpf/core.c                     |  38 +++-
>  net/core/dev.c                        | 415 +++++++++++++++++++++++++++++-----
>  net/core/filter.c                     |  10 +-
>  net/ipv4/af_inet.c                    |   1 +
>  net/ipv4/ip_input.c                   | 129 ++++++++++-
>  15 files changed, 810 insertions(+), 75 deletions(-)
>