netdev - Re: Linux kernel - 5.4.0+ (net-next from 27.11.2019) routing/network performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55238597-f966-13aa-2dae-4fde19456254@itcare.pl>
Date:   Fri, 29 Nov 2019 23:13:49 +0100
From:   Paweł Staszewski <pstaszewski@...are.pl>
To:     netdev@...r.kernel.org
Subject: Re: Linux kernel - 5.4.0+ (net-next from 27.11.2019) routing/network
 performance


W dniu 29.11.2019 o 23:00, Paweł Staszewski pisze:
> As always - each year i need to summarize network performance for 
> routing applications like linux router on native Linux kernel (without 
> xdp/dpdk/vpp etc) :)
>
> HW setup:
>
> Server (Supermicro SYS-1019P-WTR)
>
> 1x Intel 6146
>
> 2x Mellanox connect-x 5 (100G) (installed in two different x16 pcie 
> gen3.1 slots)
>
> 6x 8GB DDR4 2666 (it really matters cause 100G is about 12.5GB/s of 
> memory bandwidth one direction)
>
>
> And here it is:
>
> perf top at 72Gbit.s RX and 72Gbit/s TX (at same time)
>
>    PerfTop:   91202 irqs/sec  kernel:99.7%  exact: 100.0% [4000Hz 
> cycles:ppp],  (all, 24 CPUs)
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
>
>
>      7.56%  [kernel]       [k] __dev_queue_xmit
>      5.27%  [kernel]       [k] build_skb
>      4.41%  [kernel]       [k] rr_transmit
>      4.17%  [kernel]       [k] fib_table_lookup
>      3.83%  [kernel]       [k] mlx5e_skb_from_cqe_mpwrq_linear
>      3.30%  [kernel]       [k] mlx5e_sq_xmit
>      3.14%  [kernel]       [k] __netif_receive_skb_core
>      2.48%  [kernel]       [k] netif_skb_features
>      2.36%  [kernel]       [k] _raw_spin_trylock
>      2.27%  [kernel]       [k] dev_hard_start_xmit
>      2.26%  [kernel]       [k] dev_gro_receive
>      2.20%  [kernel]       [k] mlx5e_handle_rx_cqe_mpwrq
>      1.92%  [kernel]       [k] mlx5_eq_comp_int
>      1.91%  [kernel]       [k] mlx5e_poll_tx_cq
>      1.74%  [kernel]       [k] tcp_gro_receive
>      1.68%  [kernel]       [k] memcpy_erms
>      1.64%  [kernel]       [k] kmem_cache_free_bulk
>      1.57%  [kernel]       [k] inet_gro_receive
>      1.55%  [kernel]       [k] netdev_pick_tx
>      1.52%  [kernel]       [k] ip_forward
>      1.45%  [kernel]       [k] team_xmit
>      1.40%  [kernel]       [k] vlan_do_receive
>      1.37%  [kernel]       [k] team_handle_frame
>      1.36%  [kernel]       [k] __build_skb
>      1.33%  [kernel]       [k] ipt_do_table
>      1.33%  [kernel]       [k] mlx5e_poll_rx_cq
>      1.28%  [kernel]       [k] ip_finish_output2
>      1.26%  [kernel]       [k] vlan_passthru_hard_header
>      1.20%  [kernel]       [k] netdev_core_pick_tx
>      0.93%  [kernel]       [k] ip_rcv_core.isra.22.constprop.27
>      0.87%  [kernel]       [k] validate_xmit_skb.isra.148
>      0.87%  [kernel]       [k] ip_route_input_rcu
>      0.78%  [kernel]       [k] kmem_cache_alloc
>      0.77%  [kernel]       [k] mlx5e_handle_rx_dim
>      0.71%  [kernel]       [k] iommu_need_mapping
>      0.69%  [kernel]       [k] tasklet_action_common.isra.21
>      0.66%  [kernel]       [k] mlx5e_xmit
>      0.65%  [kernel]       [k] mlx5e_post_rx_mpwqes
>      0.63%  [kernel]       [k] _raw_spin_lock
>      0.61%  [kernel]       [k] ip_sublist_rcv
>      0.57%  [kernel]       [k] skb_release_data
>      0.53%  [kernel]       [k] __local_bh_enable_ip
>      0.53%  [kernel]       [k] tcp4_gro_receive
>      0.51%  [kernel]       [k] pfifo_fast_dequeue
>      0.51%  [kernel]       [k] page_frag_free
>      0.50%  [kernel]       [k] kmem_cache_free
>      0.47%  [kernel]       [k] dma_direct_map_page
>      0.45%  [kernel]       [k] native_irq_return_iret
>      0.44%  [kernel]       [k] __slab_free.isra.89
>      0.43%  [kernel]       [k] skb_gro_receive
>      0.43%  [kernel]       [k] napi_gro_receive
>      0.43%  [kernel]       [k] __do_softirq
>      0.41%  [kernel]       [k] sch_direct_xmit
>      0.41%  [kernel]       [k] ip_rcv_finish_core.isra.19
>      0.40%  [kernel]       [k] skb_network_protocol
>      0.40%  [kernel]       [k] __get_xps_queue_idx
>
>
> Im useing team (2x 100G LAG)- that is why here is some load:
>
>      4.41%  [kernel]       [k] rr_transmit
>
>
>
> No discards on interfaces:
>
> ethtool -S enp179s0f0 | grep disc
>      rx_discards_phy: 0
>      tx_discards_phy: 0
>
> ethtool -S enp179s0f1 | grep disc
>      rx_discards_phy: 0
>      tx_discards_phy: 0
>
> io/stream test at 72G/72G traffic:
>
> -------------------------------------------------------------
> Function    Best Rate MB/s  Avg time     Min time     Max time
> Copy:           38948.8     0.004368     0.004108     0.004533
> Scale:          37914.6     0.004473     0.004220     0.004802
> Add:            43134.6     0.005801     0.005564     0.006086
> Triad:          42934.1     0.005696     0.005590     0.005901
> -------------------------------------------------------------
>
>
> And some links to screenshoots
>
> Softirqs
>
> https://pasteboard.co/IIZkGrw.png
>
> And bandwidth / cpu / pps grapsh
>
> https://pasteboard.co/IIZl6XP.png
>
>
> Currently it looks like the biggest problem for 100G is cpu->mem->nic 
> bandwidth or nic doorbell / page cache at RX processing - cause what i 
> can see is that if I run iperf on this host i can TX full 100G - but I 
> cant RX 100G when i flood this host from some packet generator (it 
> will start to drop packets at arount 82Gbit/s) - and this is not a 
> problem with ppp but it is bandwidth problem.
>
> For example i can flood RX with 14Mpps or 64b packets without nic 
> discards but i cant flood it with 1000b frames and same pps - cause 
> when it reaches 82Gbit/s nic's start to report discards.
>
>
> Thanks
>
>
>
Forgot to add this is forwarding scenario - so router is routing packets 
from one 100G interface to another 100G interface and vice-versa (full 
BGP feed x4 from 4 different upstreams) - 700k+ flows.