netdev - Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Fri, 29 Sep 2017 09:09:40 +0200
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:     John Fastabend <john.fastabend@...il.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        peter.waskiewicz.jr@...el.com, jakub.kicinski@...ronome.com,
        netdev@...r.kernel.org, Andy Gospodarek <andy@...yhouse.net>,
        brouer@...hat.com
Subject: Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access

On Wed, 27 Sep 2017 10:32:36 -0700
Alexei Starovoitov <alexei.starovoitov@...il.com> wrote:

> On Wed, Sep 27, 2017 at 04:54:57PM +0200, Jesper Dangaard Brouer wrote:
> > On Wed, 27 Sep 2017 06:35:40 -0700
> > John Fastabend <john.fastabend@...il.com> wrote:
> >   
> > > On 09/27/2017 02:26 AM, Jesper Dangaard Brouer wrote:  
> > > > On Tue, 26 Sep 2017 21:58:53 +0200
> > > > Daniel Borkmann <daniel@...earbox.net> wrote:
> > > >     
> > > >> On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote:
> > > >> [...]    
> > > >>> I'm currently implementing a cpumap type, that transfers raw XDP frames
> > > >>> to another CPU, and the SKB is allocated on the remote CPU.  (It
> > > >>> actually works extremely well).      
> > > >>
> > > >> Meaning you let all the XDP_PASS packets get processed on a
> > > >> different CPU, so you can reserve the whole CPU just for
> > > >> prefiltering, right?     
> > > > 
> > > > Yes, exactly.  Except I use the XDP_REDIRECT action to steer packets.
> > > > The trick is using the map-flush point, to transfer packets in bulk to
> > > > the remote CPU (single call IPC is too slow), but at the same time
> > > > flush single packets if NAPI didn't see a bulk.
> > > >     
> > > >> Do you have some numbers to share at this point, just curious when
> > > >> you mention it works extremely well.    
> > > > 
> > > > Sure... I've done a lot of benchmarking on this patchset ;-)
> > > > I have a benchmark program called xdp_redirect_cpu [1][2], that collect
> > > > stats via tracepoints (atm I'm limiting bulking 8 packets, and have
> > > > tracepoints at bulk spots, to amortize tracepoint cost 25ns/8=3.125ns)
> > > > 
> > > >  [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_kern.c
> > > >  [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c
> > > > 
> > > > Here I'm installing a DDoS program that drops UDP port 9 (pktgen
> > > > packets) on RX CPU=0.  I'm forcing my netperf to hit the same CPU, that
> > > > the 11.9Mpps DDoS attack is hitting.
> > > > 
> > > > Running XDP/eBPF prog_num:4
> > > > XDP-cpumap      CPU:to  pps            drop-pps    extra-info
> > > > XDP-RX          0       12,030,471     11,966,982  0          
> > > > XDP-RX          total   12,030,471     11,966,982 
> > > > cpumap-enqueue    0:2   63,488         0           0          
> > > > cpumap-enqueue  sum:2   63,488         0           0          
> > > > cpumap_kthread  2       63,488         0           3          time_exceed
> > > > cpumap_kthread  total   63,488         0           0          
> > > > redirect_err    total   0              0          
> > > > 
> > > > $ netperf -H 172.16.0.2 -t TCP_CRR  -l 10 -D1 -T5,5 -- -r 1024,1024
> > > > Local /Remote
> > > > Socket Size   Request  Resp.   Elapsed  Trans.
> > > > Send   Recv   Size     Size    Time     Rate         
> > > > bytes  Bytes  bytes    bytes   secs.    per sec   
> > > > 
> > > > 16384  87380  1024     1024    10.00    12735.97   
> > > > 16384  87380 
> > > > 
> > > > The netperf TCP_CRR performance is the same, without XDP loaded.
> > > >     
> > > 
> > > Just curious could you also try this with RPS enabled (or does this have
> > > RPS enabled). RPS should effectively do the same thing but higher in the
> > > stack. I'm curious what the delta would be. Might be another interesting
> > > case and fairly easy to setup if you already have the above scripts.  
> > 
> > Yes, I'm essentially competing with RSP, thus such a comparison is very
> > relevant...
> > 
> > This is only a 6 CPUs system. Allocate 2 CPUs to RPS receive and let
> > other 4 CPUS process packet.
> > 
> > Summary of RPS (Receive Packet Steering) performance:
> >  * End result is 6.3 Mpps max performance
> >  * netperf TCP_CRR is 1 trans/sec.
> >  * Each RX-RPS CPU stall at ~3.2Mpps.
> > 
> > The full test report below with setup:
> > 
> > The mask needed::
> > 
> >  perl -e 'printf "%b\n",0x3C'
> >  111100
> > 
> > RPS setup::
> > 
> >  sudo sh -c 'echo 32768 > /proc/sys/net/core/rps_sock_flow_entries'
> > 
> >  for N in $(seq 0 5) ; do \
> >    sudo sh -c "echo 8192 > /sys/class/net/ixgbe1/queues/rx-$N/rps_flow_cnt" ; \
> >    sudo sh -c "echo 3c > /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus" ; \
> >    grep -H . /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus ; \
> >  done
> > 
> > Reduce RX queues to two ::
> > 
> >  ethtool -L ixgbe1 combined 2
> > 
> > IRQ align to CPU numbers::
> > 
> >  $ ~/setup01.sh
> >  Not root, running with sudo
> >   --- Disable Ethernet flow-control ---
> >  rx unmodified, ignoring
> >  tx unmodified, ignoring
> >  no pause parameters changed, aborting
> >  rx unmodified, ignoring
> >  tx unmodified, ignoring
> >  no pause parameters changed, aborting
> >   --- Align IRQs ---
> >  /proc/irq/54/ixgbe1-TxRx-0/../smp_affinity_list:0
> >  /proc/irq/55/ixgbe1-TxRx-1/../smp_affinity_list:1
> >  /proc/irq/56/ixgbe1/../smp_affinity_list:0-5
> > 
> > $ grep -H . /sys/class/net/ixgbe1/queues/rx-*/rps_cpus
> > /sys/class/net/ixgbe1/queues/rx-0/rps_cpus:3c
> > /sys/class/net/ixgbe1/queues/rx-1/rps_cpus:3c
> > 
> > Generator is sending: 12,715,782 tx_packets /sec
> > 
> >  ./pktgen_sample04_many_flows.sh -vi ixgbe2 -m 00:1b:21:bb:9a:84 \
> >     -d 172.16.0.2 -t8
> > 
> > $ nstat > /dev/null && sleep 1 && nstat
> > #kernel
> > IpInReceives                    6346544            0.0
> > IpInDelivers                    6346544            0.0
> > IpOutRequests                   1020               0.0
> > IcmpOutMsgs                     1020               0.0
> > IcmpOutDestUnreachs             1020               0.0
> > IcmpMsgOutType3                 1020               0.0
> > UdpNoPorts                      6346898            0.0
> > IpExtInOctets                   291964714          0.0
> > IpExtOutOctets                  73440              0.0
> > IpExtInNoECTPkts                6347063            0.0
> > 
> > $ mpstat -P ALL -u -I SCPU -I SUM
> > 
> > Average:     CPU    %usr   %nice    %sys   %irq   %soft  %idle
> > Average:     all    0.00    0.00    0.00   0.42   72.97  26.61
> > Average:       0    0.00    0.00    0.00   0.17   99.83   0.00
> > Average:       1    0.00    0.00    0.00   0.17   99.83   0.00
> > Average:       2    0.00    0.00    0.00   0.67   60.37  38.96
> > Average:       3    0.00    0.00    0.00   0.67   58.70  40.64
> > Average:       4    0.00    0.00    0.00   0.67   59.53  39.80
> > Average:       5    0.00    0.00    0.00   0.67   58.93  40.40
> > 
> > Average:     CPU    intr/s
> > Average:     all 152067.22
> > Average:       0  50064.73
> > Average:       1  50089.35
> > Average:       2  45095.17
> > Average:       3  44875.04
> > Average:       4  44906.32
> > Average:       5  45152.08
> > 
> > Average:     CPU     TIMER/s   NET_TX/s   NET_RX/s TASKLET/s  SCHED/s     RCU/s
> > Average:       0      609.48       0.17   49431.28      0.00     2.66     21.13
> > Average:       1      567.55       0.00   49498.00      0.00     2.66     21.13
> > Average:       2      998.34       0.00   43941.60      4.16    82.86     68.22
> > Average:       3      540.60       0.17   44140.27      0.00    85.52    108.49
> > Average:       4      537.27       0.00   44219.63      0.00    84.53     64.89
> > Average:       5      530.78       0.17   44445.59      0.00    85.02     90.52
> > 
> > From mpstat it looks like it is the RX-RPS CPUs that are the bottleneck.
> > 
> > Show adapter(s) (ixgbe1) statistics (ONLY that changed!)
> > Ethtool(ixgbe1) stat:     11109531 (   11,109,531) <= fdir_miss /sec
> > Ethtool(ixgbe1) stat:    380632356 (  380,632,356) <= rx_bytes /sec
> > Ethtool(ixgbe1) stat:    812792611 (  812,792,611) <= rx_bytes_nic /sec
> > Ethtool(ixgbe1) stat:      1753550 (    1,753,550) <= rx_missed_errors /sec
> > Ethtool(ixgbe1) stat:      4602487 (    4,602,487) <= rx_no_dma_resources /sec
> > Ethtool(ixgbe1) stat:      6343873 (    6,343,873) <= rx_packets /sec
> > Ethtool(ixgbe1) stat:     10946441 (   10,946,441) <= rx_pkts_nic /sec
> > Ethtool(ixgbe1) stat:    190287853 (  190,287,853) <= rx_queue_0_bytes /sec
> > Ethtool(ixgbe1) stat:      3171464 (    3,171,464) <= rx_queue_0_packets /sec
> > Ethtool(ixgbe1) stat:    190344503 (  190,344,503) <= rx_queue_1_bytes /sec
> > Ethtool(ixgbe1) stat:      3172408 (    3,172,408) <= rx_queue_1_packets /sec
> > 
> > Notice, each RX-CPU can only process 3.1Mpps.
> > 
> > RPS RX-CPU(0):
> > 
> >  # Overhead  CPU  Symbol
> >  # ........  ...  .......................................
> >  #
> >     11.72%  000  [k] ixgbe_poll
> >     11.29%  000  [k] _raw_spin_lock
> >     10.35%  000  [k] dev_gro_receive
> >      8.36%  000  [k] __build_skb
> >      7.35%  000  [k] __skb_get_hash
> >      6.22%  000  [k] enqueue_to_backlog
> >      5.89%  000  [k] __skb_flow_dissect
> >      4.43%  000  [k] inet_gro_receive
> >      4.19%  000  [k] ___slab_alloc
> >      3.90%  000  [k] queued_spin_lock_slowpath
> >      3.85%  000  [k] kmem_cache_alloc
> >      3.06%  000  [k] build_skb
> >      2.66%  000  [k] get_rps_cpu
> >      2.57%  000  [k] napi_gro_receive
> >      2.34%  000  [k] eth_type_trans
> >      1.81%  000  [k] __cmpxchg_double_slab.isra.61
> >      1.47%  000  [k] ixgbe_alloc_rx_buffers
> >      1.43%  000  [k] get_partial_node.isra.81
> >      0.84%  000  [k] swiotlb_sync_single
> >      0.74%  000  [k] udp4_gro_receive
> >      0.73%  000  [k] netif_receive_skb_internal
> >      0.72%  000  [k] udp_gro_receive
> >      0.63%  000  [k] skb_gro_reset_offset
> >      0.49%  000  [k] __skb_flow_get_ports
> >      0.48%  000  [k] llist_add_batch
> >      0.36%  000  [k] swiotlb_sync_single_for_cpu
> >      0.34%  000  [k] __slab_alloc
> > 
> > 
> > Remote RPS-CPU(3) getting packets::
> > 
> >  # Overhead  CPU  Symbol
> >  # ........  ...  ..............................................
> >  #
> >     33.02%  003  [k] poll_idle
> >     10.99%  003  [k] __netif_receive_skb_core
> >     10.45%  003  [k] page_frag_free
> >      8.49%  003  [k] ip_rcv
> >      4.19%  003  [k] fib_table_lookup
> >      2.84%  003  [k] __udp4_lib_rcv
> >      2.81%  003  [k] __slab_free

Notice slow-path of SLUB

> >      2.23%  003  [k] __udp4_lib_lookup
> >      2.09%  003  [k] ip_route_input_rcu
> >      2.07%  003  [k] kmem_cache_free
> >      2.06%  003  [k] udp_v4_early_demux
> >      1.73%  003  [k] ip_rcv_finish  
> 
> Very interesting data.

You removed some of the more interesting part of the perf-report, that
showed us hitting more of the SLUB slowpath for SKBs.  The slowpath
consist of many separate function calls, thus it doesn't bubble to the
top (the FlameGraph tool shows them easier).

> So above perf report compares to xdp-redirect-cpu this one:
> Perf top on a CPU(3) that have to alloc and free SKBs etc.
> 
> # Overhead  CPU  Symbol
> # ........  ...  .......................................
> #
>     15.51%  003  [k] fib_table_lookup
>      8.91%  003  [k] cpu_map_kthread_run
>      8.04%  003  [k] build_skb
>      7.88%  003  [k] page_frag_free
>      5.13%  003  [k] kmem_cache_alloc
>      4.76%  003  [k] ip_route_input_rcu
>      4.59%  003  [k] kmem_cache_free
>      4.02%  003  [k] __udp4_lib_rcv
>      3.20%  003  [k] fib_validate_source
>      3.02%  003  [k] __netif_receive_skb_core
>      3.02%  003  [k] udp_v4_early_demux
>      2.90%  003  [k] ip_rcv
>      2.80%  003  [k] ip_rcv_finish
> 
> right?
> and in RPS case the consumer cpu is 33% idle whereas in redirect-cpu
> you can load it up all the way.
> Am I interpreting all this correctly that with RPS cpu0 cannot
> distributed the packets to other cpus fast enough and that's
> a bottleneck?

Yes, exactly. The work needed on the RPS cpu0 is simply too much.

> whereas in redirect-cpu you're doing early packet distribution
> before skb alloc?

Yes, the main point to reducing the CPU cycles spend on the packet for
doing early packet distribution.

> So in other words with redirect-cpu all consumer cpus are doing
> skb alloc and in RPS cpu0 is allocating skbs for all ?

Yes.

> and that's where 6M->12M performance gain comes from?

Yes, basically.  There are many small thing that help this along.  Like
cpumap case always hitting the SLUB fastpath.  Another big thing is
bulking. It is sort of hidden, but the XDP_REDIRECT flush mechanism is
implementing the RX bulking (I've been "screaming" about for the last
couple of years! ;-))

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer