netdev - Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d8d4b059c4d7dff8f05839c5042051b41fdd1a7a.camel@mellanox.com>
Date:   Thu, 1 Nov 2018 20:37:40 +0000
From:   Saeed Mahameed <saeedm@...lanox.com>
To:     "pstaszewski@...are.pl" <pstaszewski@...are.pl>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: Kernel 4.19 network performance - forwarding/routing normal users
 traffic

On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote:
> 
> W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze:
> > On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote:
> > > Hi
> > > 
> > > So maybee someone will be interested how linux kernel handles
> > > normal
> > > traffic (not pktgen :) )
> > > 
> > > 
> > > Server HW configuration:
> > > 
> > > CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
> > > 
> > > NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
> > > 
> > > 
> > > Server software:
> > > 
> > > FRR - as routing daemon
> > > 
> > > enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to
> > > local
> > > numa
> > > node)
> > > 
> > > enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local
> > > numa
> > > node)
> > > 
> > > 
> > > Maximum traffic that server can handle:
> > > 
> > > Bandwidth
> > > 
> > >    bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> > >     input: /proc/net/dev type: rate
> > >     \         iface                   Rx Tx                Total
> > > =================================================================
> > > ====
> > > =========
> > >          enp175s0f1:          28.51 Gb/s           37.24
> > > Gb/s
> > > 65.74 Gb/s
> > >          enp175s0f0:          38.07 Gb/s           28.44
> > > Gb/s
> > > 66.51 Gb/s
> > > ---------------------------------------------------------------
> > > ----
> > > -----------
> > >               total:          66.58 Gb/s           65.67
> > > Gb/s
> > > 132.25 Gb/s
> > > 
> > > 
> > > Packets per second:
> > > 
> > >    bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> > >     input: /proc/net/dev type: rate
> > >     -         iface                   Rx Tx                Total
> > > =================================================================
> > > ====
> > > =========
> > >          enp175s0f1:      5248589.00 P/s       3486617.75 P/s
> > > 8735207.00 P/s
> > >          enp175s0f0:      3557944.25 P/s       5232516.00 P/s
> > > 8790460.00 P/s
> > > ---------------------------------------------------------------
> > > ----
> > > -----------
> > >               total:      8806533.00 P/s       8719134.00 P/s
> > > 17525668.00 P/s
> > > 
> > > 
> > > After reaching that limits nics on the upstream side (more RX
> > > traffic)
> > > start to drop packets
> > > 
> > > 
> > > I just dont understand that server can't handle more bandwidth
> > > (~40Gbit/s is limit where all cpu's are 100% util) - where pps on
> > > RX
> > > side are increasing.
> > > 
> > 
> > Where do you see 40 Gb/s ? you showed that both ports on the same
> > NIC (
> > same pcie link) are doing  66.58 Gb/s (RX) + 65.67 Gb/s (TX) =
> > 132.25
> > Gb/s which aligns with your pcie link limit, what am i missing ?
> 
> hmm yes that was my concern also - cause cant find anywhere
> informations 
> about that bandwidth is uni or bidirectional - so if 126Gbit for x16
> 8GT 
> is unidir - then bidir will be 126/2 ~68Gbit - which will fit total
> bw 
> on both ports

i think it is bidir 

> This can explain maybee also why cpuload is rising rapidly from 
> 120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net -
> so 
> there can be some error in reading them when offloading (gro/gso/tso)
> on 
> nic's is enabled that is why
> 
> > 
> > > Was thinking that maybee reached some pcie x16 limit - but x16
> > > 8GT
> > > is
> > > 126Gbit - and also when testing with pktgen i can reach more bw
> > > and
> > > pps
> > > (like 4x more comparing to normal internet traffic)
> > > 
> > 
> > Are you forwarding when using pktgen as well or you just testing
> > the RX
> > side pps ?
> 
> Yes pktgen was tested on single port RX
> Can check also forwarding to eliminate pciex limits
> 

So this explains why you have more RX pps, since tx is idle and pcie
will be free to do only rx.

[...]


> > 
> > > ethtool -S enp175s0f1
> > > NIC statistics:
> > >        rx_packets: 173730800927
> > >        rx_bytes: 99827422751332
> > >        tx_packets: 142532009512
> > >        tx_bytes: 184633045911222
> > >        tx_tso_packets: 25989113891
> > >        tx_tso_bytes: 132933363384458
> > >        tx_tso_inner_packets: 0
> > >        tx_tso_inner_bytes: 0
> > >        tx_added_vlan_packets: 74630239613
> > >        tx_nop: 2029817748
> > >        rx_lro_packets: 0
> > >        rx_lro_bytes: 0
> > >        rx_ecn_mark: 0
> > >        rx_removed_vlan_packets: 173730800927
> > >        rx_csum_unnecessary: 0
> > >        rx_csum_none: 434357
> > >        rx_csum_complete: 173730366570
> > >        rx_csum_unnecessary_inner: 0
> > >        rx_xdp_drop: 0
> > >        rx_xdp_redirect: 0
> > >        rx_xdp_tx_xmit: 0
> > >        rx_xdp_tx_full: 0
> > >        rx_xdp_tx_err: 0
> > >        rx_xdp_tx_cqe: 0
> > >        tx_csum_none: 38260960853
> > >        tx_csum_partial: 36369278774
> > >        tx_csum_partial_inner: 0
> > >        tx_queue_stopped: 1
> > >        tx_queue_dropped: 0
> > >        tx_xmit_more: 748638099
> > >        tx_recover: 0
> > >        tx_cqes: 73881645031
> > >        tx_queue_wake: 1
> > >        tx_udp_seg_rem: 0
> > >        tx_cqe_err: 0
> > >        tx_xdp_xmit: 0
> > >        tx_xdp_full: 0
> > >        tx_xdp_err: 0
> > >        tx_xdp_cqes: 0
> > >        rx_wqe_err: 0
> > >        rx_mpwqe_filler_cqes: 0
> > >        rx_mpwqe_filler_strides: 0
> > >        rx_buff_alloc_err: 0
> > >        rx_cqe_compress_blks: 0
> > >        rx_cqe_compress_pkts: 0
> > 
> > If this is a pcie bottleneck it might be useful to  enable CQE
> > compression (to reduce PCIe completion descriptors transactions)
> > you should see the above rx_cqe_compress_pkts increasing when
> > enabled.
> > 
> > $ ethtool  --set-priv-flags enp175s0f1 rx_cqe_compress on
> > $ ethtool --show-priv-flags enp175s0f1
> > Private flags for p6p1:
> > rx_cqe_moder       : on
> > cqe_moder          : off
> > rx_cqe_compress    : on
> > ...
> > 
> > try this on both interfaces.
> 
> Done
> ethtool --show-priv-flags enp175s0f1
> Private flags for enp175s0f1:
> rx_cqe_moder       : on
> tx_cqe_moder       : off
> rx_cqe_compress    : on
> rx_striding_rq     : off
> rx_no_csum_complete: off
> 
> ethtool --show-priv-flags enp175s0f0
> Private flags for enp175s0f0:
> rx_cqe_moder       : on
> tx_cqe_moder       : off
> rx_cqe_compress    : on
> rx_striding_rq     : off
> rx_no_csum_complete: off
> 

did it help reduce the load on the pcie  ? do you see more pps ?
what is the ratio between rx_cqe_compress_pkts and over all rx packets
?

[...]

> > > ethtool -S enp175s0f0
> > > NIC statistics:
> > >        rx_packets: 141574897253
> > >        rx_bytes: 184445040406258
> > >        tx_packets: 172569543894
> > >        tx_bytes: 99486882076365
> > >        tx_tso_packets: 9367664195
> > >        tx_tso_bytes: 56435233992948
> > >        tx_tso_inner_packets: 0
> > >        tx_tso_inner_bytes: 0
> > >        tx_added_vlan_packets: 141297671626
> > >        tx_nop: 2102916272
> > >        rx_lro_packets: 0
> > >        rx_lro_bytes: 0
> > >        rx_ecn_mark: 0
> > >        rx_removed_vlan_packets: 141574897252
> > >        rx_csum_unnecessary: 0
> > >        rx_csum_none: 23135854
> > >        rx_csum_complete: 141551761398
> > >        rx_csum_unnecessary_inner: 0
> > >        rx_xdp_drop: 0
> > >        rx_xdp_redirect: 0
> > >        rx_xdp_tx_xmit: 0
> > >        rx_xdp_tx_full: 0
> > >        rx_xdp_tx_err: 0
> > >        rx_xdp_tx_cqe: 0
> > >        tx_csum_none: 127934791664
> > 
> > It is a good idea to look into this, tx is not requesting hw tx
> > csumming for a lot of packets, maybe you are wasting a lot of cpu
> > on
> > calculating csum, or maybe this is just the rx csum complete..
> > 
> > >        tx_csum_partial: 13362879974
> > >        tx_csum_partial_inner: 0
> > >        tx_queue_stopped: 232561
> > 
> > TX queues are stalling, could be an indentation for the pcie
> > bottelneck.
> > 
> > >        tx_queue_dropped: 0
> > >        tx_xmit_more: 1266021946
> > >        tx_recover: 0
> > >        tx_cqes: 140031716469
> > >        tx_queue_wake: 232561
> > >        tx_udp_seg_rem: 0
> > >        tx_cqe_err: 0
> > >        tx_xdp_xmit: 0
> > >        tx_xdp_full: 0
> > >        tx_xdp_err: 0
> > >        tx_xdp_cqes: 0
> > >        rx_wqe_err: 0
> > >        rx_mpwqe_filler_cqes: 0
> > >        rx_mpwqe_filler_strides: 0
> > >        rx_buff_alloc_err: 0
> > >        rx_cqe_compress_blks: 0
> > >        rx_cqe_compress_pkts: 0
> > >        rx_page_reuse: 0
> > >        rx_cache_reuse: 16625975793
> > >        rx_cache_full: 54161465914
> > >        rx_cache_empty: 258048
> > >        rx_cache_busy: 54161472735
> > >        rx_cache_waive: 0
> > >        rx_congst_umr: 0
> > >        rx_arfs_err: 0
> > >        ch_events: 40572621887
> > >        ch_poll: 40885650979
> > >        ch_arm: 40429276692
> > >        ch_aff_change: 0
> > >        ch_eq_rearm: 0
> > >        rx_out_of_buffer: 2791690
> > >        rx_if_down_packets: 74
> > >        rx_vport_unicast_packets: 141843476308
> > >        rx_vport_unicast_bytes: 185421265403318
> > >        tx_vport_unicast_packets: 172569484005
> > >        tx_vport_unicast_bytes: 100019940094298
> > >        rx_vport_multicast_packets: 85122935
> > >        rx_vport_multicast_bytes: 5761316431
> > >        tx_vport_multicast_packets: 6452
> > >        tx_vport_multicast_bytes: 643540
> > >        rx_vport_broadcast_packets: 22423624
> > >        rx_vport_broadcast_bytes: 1390127090
> > >        tx_vport_broadcast_packets: 22024
> > >        tx_vport_broadcast_bytes: 1321440
> > >        rx_vport_rdma_unicast_packets: 0
> > >        rx_vport_rdma_unicast_bytes: 0
> > >        tx_vport_rdma_unicast_packets: 0
> > >        tx_vport_rdma_unicast_bytes: 0
> > >        rx_vport_rdma_multicast_packets: 0
> > >        rx_vport_rdma_multicast_bytes: 0
> > >        tx_vport_rdma_multicast_packets: 0
> > >        tx_vport_rdma_multicast_bytes: 0
> > >        tx_packets_phy: 172569501577
> > >        rx_packets_phy: 142871314588
> > >        rx_crc_errors_phy: 0
> > >        tx_bytes_phy: 100710212814151
> > >        rx_bytes_phy: 187209224289564
> > >        tx_multicast_phy: 6452
> > >        tx_broadcast_phy: 22024
> > >        rx_multicast_phy: 85122933
> > >        rx_broadcast_phy: 22423623
> > >        rx_in_range_len_errors_phy: 2
> > >        rx_out_of_range_len_phy: 0
> > >        rx_oversize_pkts_phy: 0
> > >        rx_symbol_err_phy: 0
> > >        tx_mac_control_phy: 0
> > >        rx_mac_control_phy: 0
> > >        rx_unsupported_op_phy: 0
> > >        rx_pause_ctrl_phy: 0
> > >        tx_pause_ctrl_phy: 0
> > >        rx_discards_phy: 920161423
> > 
> > Ok, this port seem to be suffering more, RX is congested, maybe due
> > to
> > the pcie bottleneck.
> 
> Yes this side is receiving more traffic - second port is +10G more tx
> 

[...]


> > > Average:      17    0.00    0.00   16.60    0.00    0.00 52.10
> > > 0.00    0.00    0.00   31.30
> > > Average:      18    0.00    0.00   13.90    0.00    0.00 61.20
> > > 0.00    0.00    0.00   24.90
> > > Average:      19    0.00    0.00    9.99    0.00    0.00 70.33
> > > 0.00    0.00    0.00   19.68
> > > Average:      20    0.00    0.00    9.00    0.00    0.00 73.00
> > > 0.00    0.00    0.00   18.00
> > > Average:      21    0.00    0.00    8.70    0.00    0.00 73.90
> > > 0.00    0.00    0.00   17.40
> > > Average:      22    0.00    0.00   15.42    0.00    0.00 58.56
> > > 0.00    0.00    0.00   26.03
> > > Average:      23    0.00    0.00   10.81    0.00    0.00 71.67
> > > 0.00    0.00    0.00   17.52
> > > Average:      24    0.00    0.00   10.00    0.00    0.00 71.80
> > > 0.00    0.00    0.00   18.20
> > > Average:      25    0.00    0.00   11.19    0.00    0.00 71.13
> > > 0.00    0.00    0.00   17.68
> > > Average:      26    0.00    0.00   11.00    0.00    0.00 70.80
> > > 0.00    0.00    0.00   18.20
> > > Average:      27    0.00    0.00   10.01    0.00    0.00 69.57
> > > 0.00    0.00    0.00   20.42
> > 
> > The numa cores are not at 100% util, you have around 20% of idle on
> > each one.
> 
> Yes - no 100% cpu - but the difference between 80% and 100% is like
> push 
> aditional 1-2Gbit/s
> 

yes but, it doens't look like the bottleneck is the cpu, although it is
close to be :).. 

> > 
> > > Average:      28    0.00    0.00    0.00    0.00    0.00 0.00
> > > 0.00
> > > 0.00    0.00  100.00
> > > Average:      29    0.00    0.00    0.00    0.00    0.00 0.00
> > > 0.00
> > > 0.00    0.00  100.00
> > > Average:      30    0.00    0.00    0.00    0.00    0.00 0.00
> > > 0.00
> > > 0.00    0.00  100.00
> > > Average:      31    0.00    0.00    0.00    0.00    0.00 0.00
> > > 0.00
> > > 0.00    0.00  100.00
> > > Average:      32    0.00    0.00    0.00    0.00    0.00 0.00
> > > 0.00
> > > 0.00    0.00  100.00
> > > Average:      33    0.00    0.00    3.90    0.00    0.00 0.00
> > > 0.00
> > > 0.00    0.00   96.10
> > > Average:      34    0.00    0.00    0.00    0.00    0.00 0.00
> > > 0.00
> > > 0.00    0.00  100.00
> > > Average:      35    0.00    0.00    0.00    0.00    0.00 0.00
> > > 0.00
> > > 0.00    0.00  100.00
> > > Average:      36    0.10    0.00    0.20    0.00    0.00 0.00
> > > 0.00
> > > 0.00    0.00   99.70
> > > Average:      37    0.20    0.00    0.30    0.00    0.00 0.00
> > > 0.00
> > > 0.00    0.00   99.50
> > > Average:      38    0.00    0.00    0.00    0.00    0.00 0.00
> > > 0.00
> > > 0.00    0.00  100.00
> > > Average:      39    0.00    0.00    2.60    0.00    0.00 0.00
> > > 0.00
> > > 0.00    0.00   97.40
> > > Average:      40    0.00    0.00    0.90    0.00    0.00 0.00
> > > 0.00
> > > 0.00    0.00   99.10
> > > Average:      41    0.10    0.00    0.50    0.00    0.00 0.00
> > > 0.00
> > > 0.00    0.00   99.40
> > > Average:      42    0.00    0.00    9.91    0.00    0.00 70.67
> > > 0.00    0.00    0.00   19.42
> > > Average:      43    0.00    0.00   15.90    0.00    0.00 57.50
> > > 0.00    0.00    0.00   26.60
> > > Average:      44    0.00    0.00   12.20    0.00    0.00 66.20
> > > 0.00    0.00    0.00   21.60
> > > Average:      45    0.00    0.00   12.00    0.00    0.00 67.50
> > > 0.00    0.00    0.00   20.50
> > > Average:      46    0.00    0.00   12.90    0.00    0.00 65.50
> > > 0.00    0.00    0.00   21.60
> > > Average:      47    0.00    0.00   14.59    0.00    0.00 60.84
> > > 0.00    0.00    0.00   24.58
> > > Average:      48    0.00    0.00   13.59    0.00    0.00 61.74
> > > 0.00    0.00    0.00   24.68
> > > Average:      49    0.00    0.00   18.36    0.00    0.00 53.29
> > > 0.00    0.00    0.00   28.34
> > > Average:      50    0.00    0.00   15.32    0.00    0.00 58.86
> > > 0.00    0.00    0.00   25.83
> > > Average:      51    0.00    0.00   17.60    0.00    0.00 55.20
> > > 0.00    0.00    0.00   27.20
> > > Average:      52    0.00    0.00   15.92    0.00    0.00 56.06
> > > 0.00    0.00    0.00   28.03
> > > Average:      53    0.00    0.00   13.00    0.00    0.00 62.30
> > > 0.00    0.00    0.00   24.70
> > > Average:      54    0.00    0.00   13.20    0.00    0.00 61.50
> > > 0.00    0.00    0.00   25.30
> > > Average:      55    0.00    0.00   14.59    0.00    0.00 58.64
> > > 0.00    0.00    0.00   26.77
> > > 
> > > 
> > > ethtool -k enp175s0f0
> > > Features for enp175s0f0:
> > > rx-checksumming: on
> > > tx-checksumming: on
> > >           tx-checksum-ipv4: on
> > >           tx-checksum-ip-generic: off [fixed]
> > >           tx-checksum-ipv6: on
> > >           tx-checksum-fcoe-crc: off [fixed]
> > >           tx-checksum-sctp: off [fixed]
> > > scatter-gather: on
> > >           tx-scatter-gather: on
> > >           tx-scatter-gather-fraglist: off [fixed]
> > > tcp-segmentation-offload: on
> > >           tx-tcp-segmentation: on
> > >           tx-tcp-ecn-segmentation: off [fixed]
> > >           tx-tcp-mangleid-segmentation: off
> > >           tx-tcp6-segmentation: on
> > > udp-fragmentation-offload: off
> > > generic-segmentation-offload: on
> > > generic-receive-offload: on
> > > large-receive-offload: off [fixed]
> > > rx-vlan-offload: on
> > > tx-vlan-offload: on
> > > ntuple-filters: off
> > > receive-hashing: on
> > > highdma: on [fixed]
> > > rx-vlan-filter: on
> > > vlan-challenged: off [fixed]
> > > tx-lockless: off [fixed]
> > > netns-local: off [fixed]
> > > tx-gso-robust: off [fixed]
> > > tx-fcoe-segmentation: off [fixed]
> > > tx-gre-segmentation: on
> > > tx-gre-csum-segmentation: on
> > > tx-ipxip4-segmentation: off [fixed]
> > > tx-ipxip6-segmentation: off [fixed]
> > > tx-udp_tnl-segmentation: on
> > > tx-udp_tnl-csum-segmentation: on
> > > tx-gso-partial: on
> > > tx-sctp-segmentation: off [fixed]
> > > tx-esp-segmentation: off [fixed]
> > > tx-udp-segmentation: on
> > > fcoe-mtu: off [fixed]
> > > tx-nocache-copy: off
> > > loopback: off [fixed]
> > > rx-fcs: off
> > > rx-all: off
> > > tx-vlan-stag-hw-insert: on
> > > rx-vlan-stag-hw-parse: off [fixed]
> > > rx-vlan-stag-filter: on [fixed]
> > > l2-fwd-offload: off [fixed]
> > > hw-tc-offload: off
> > > esp-hw-offload: off [fixed]
> > > esp-tx-csum-hw-offload: off [fixed]
> > > rx-udp_tunnel-port-offload: on
> > > tls-hw-tx-offload: off [fixed]
> > > tls-hw-rx-offload: off [fixed]
> > > rx-gro-hw: off [fixed]
> > > tls-hw-record: off [fixed]
> > > 
> > > ethtool -c enp175s0f0
> > > Coalesce parameters for enp175s0f0:
> > > Adaptive RX: off  TX: on
> > > stats-block-usecs: 0
> > > sample-interval: 0
> > > pkt-rate-low: 0
> > > pkt-rate-high: 0
> > > dmac: 32703
> > > 
> > > rx-usecs: 256
> > > rx-frames: 128
> > > rx-usecs-irq: 0
> > > rx-frames-irq: 0
> > > 
> > > tx-usecs: 8
> > > tx-frames: 128
> > > tx-usecs-irq: 0
> > > tx-frames-irq: 0
> > > 
> > > rx-usecs-low: 0
> > > rx-frame-low: 0
> > > tx-usecs-low: 0
> > > tx-frame-low: 0
> > > 
> > > rx-usecs-high: 0
> > > rx-frame-high: 0
> > > tx-usecs-high: 0
> > > tx-frame-high: 0
> > > 
> > > ethtool -g enp175s0f0
> > > Ring parameters for enp175s0f0:
> > > Pre-set maximums:
> > > RX:             8192
> > > RX Mini:        0
> > > RX Jumbo:       0
> > > TX:             8192
> > > Current hardware settings:
> > > RX:             4096
> > > RX Mini:        0
> > > RX Jumbo:       0
> > > TX:             4096
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> 
> Also changed a little coalesce params - and best for this config are:
> ethtool -c enp175s0f0
> Coalesce parameters for enp175s0f0:
> Adaptive RX: off  TX: off
> stats-block-usecs: 0
> sample-interval: 0
> pkt-rate-low: 0
> pkt-rate-high: 0
> dmac: 32573
> 
> rx-usecs: 40
> rx-frames: 128
> rx-usecs-irq: 0
> rx-frames-irq: 0
> 
> tx-usecs: 8
> tx-frames: 8
> tx-usecs-irq: 0
> tx-frames-irq: 0
> 
> rx-usecs-low: 0
> rx-frame-low: 0
> tx-usecs-low: 0
> tx-frame-low: 0
> 
> rx-usecs-high: 0
> rx-frame-high: 0
> tx-usecs-high: 0
> tx-frame-high: 0
> 
> 
> Less drops on RX side - and more pps in overall forwarded.
> 

how much improvement ? maybe we can improve our adaptive rx coal to be
efficient for this work load.