[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d8d4b059c4d7dff8f05839c5042051b41fdd1a7a.camel@mellanox.com>
Date: Thu, 1 Nov 2018 20:37:40 +0000
From: Saeed Mahameed <saeedm@...lanox.com>
To: "pstaszewski@...are.pl" <pstaszewski@...are.pl>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: Kernel 4.19 network performance - forwarding/routing normal users
traffic
On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote:
>
> W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze:
> > On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote:
> > > Hi
> > >
> > > So maybee someone will be interested how linux kernel handles
> > > normal
> > > traffic (not pktgen :) )
> > >
> > >
> > > Server HW configuration:
> > >
> > > CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
> > >
> > > NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
> > >
> > >
> > > Server software:
> > >
> > > FRR - as routing daemon
> > >
> > > enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to
> > > local
> > > numa
> > > node)
> > >
> > > enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local
> > > numa
> > > node)
> > >
> > >
> > > Maximum traffic that server can handle:
> > >
> > > Bandwidth
> > >
> > > bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> > > input: /proc/net/dev type: rate
> > > \ iface Rx Tx Total
> > > =================================================================
> > > ====
> > > =========
> > > enp175s0f1: 28.51 Gb/s 37.24
> > > Gb/s
> > > 65.74 Gb/s
> > > enp175s0f0: 38.07 Gb/s 28.44
> > > Gb/s
> > > 66.51 Gb/s
> > > ---------------------------------------------------------------
> > > ----
> > > -----------
> > > total: 66.58 Gb/s 65.67
> > > Gb/s
> > > 132.25 Gb/s
> > >
> > >
> > > Packets per second:
> > >
> > > bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> > > input: /proc/net/dev type: rate
> > > - iface Rx Tx Total
> > > =================================================================
> > > ====
> > > =========
> > > enp175s0f1: 5248589.00 P/s 3486617.75 P/s
> > > 8735207.00 P/s
> > > enp175s0f0: 3557944.25 P/s 5232516.00 P/s
> > > 8790460.00 P/s
> > > ---------------------------------------------------------------
> > > ----
> > > -----------
> > > total: 8806533.00 P/s 8719134.00 P/s
> > > 17525668.00 P/s
> > >
> > >
> > > After reaching that limits nics on the upstream side (more RX
> > > traffic)
> > > start to drop packets
> > >
> > >
> > > I just dont understand that server can't handle more bandwidth
> > > (~40Gbit/s is limit where all cpu's are 100% util) - where pps on
> > > RX
> > > side are increasing.
> > >
> >
> > Where do you see 40 Gb/s ? you showed that both ports on the same
> > NIC (
> > same pcie link) are doing 66.58 Gb/s (RX) + 65.67 Gb/s (TX) =
> > 132.25
> > Gb/s which aligns with your pcie link limit, what am i missing ?
>
> hmm yes that was my concern also - cause cant find anywhere
> informations
> about that bandwidth is uni or bidirectional - so if 126Gbit for x16
> 8GT
> is unidir - then bidir will be 126/2 ~68Gbit - which will fit total
> bw
> on both ports
i think it is bidir
> This can explain maybee also why cpuload is rising rapidly from
> 120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net -
> so
> there can be some error in reading them when offloading (gro/gso/tso)
> on
> nic's is enabled that is why
>
> >
> > > Was thinking that maybee reached some pcie x16 limit - but x16
> > > 8GT
> > > is
> > > 126Gbit - and also when testing with pktgen i can reach more bw
> > > and
> > > pps
> > > (like 4x more comparing to normal internet traffic)
> > >
> >
> > Are you forwarding when using pktgen as well or you just testing
> > the RX
> > side pps ?
>
> Yes pktgen was tested on single port RX
> Can check also forwarding to eliminate pciex limits
>
So this explains why you have more RX pps, since tx is idle and pcie
will be free to do only rx.
[...]
> >
> > > ethtool -S enp175s0f1
> > > NIC statistics:
> > > rx_packets: 173730800927
> > > rx_bytes: 99827422751332
> > > tx_packets: 142532009512
> > > tx_bytes: 184633045911222
> > > tx_tso_packets: 25989113891
> > > tx_tso_bytes: 132933363384458
> > > tx_tso_inner_packets: 0
> > > tx_tso_inner_bytes: 0
> > > tx_added_vlan_packets: 74630239613
> > > tx_nop: 2029817748
> > > rx_lro_packets: 0
> > > rx_lro_bytes: 0
> > > rx_ecn_mark: 0
> > > rx_removed_vlan_packets: 173730800927
> > > rx_csum_unnecessary: 0
> > > rx_csum_none: 434357
> > > rx_csum_complete: 173730366570
> > > rx_csum_unnecessary_inner: 0
> > > rx_xdp_drop: 0
> > > rx_xdp_redirect: 0
> > > rx_xdp_tx_xmit: 0
> > > rx_xdp_tx_full: 0
> > > rx_xdp_tx_err: 0
> > > rx_xdp_tx_cqe: 0
> > > tx_csum_none: 38260960853
> > > tx_csum_partial: 36369278774
> > > tx_csum_partial_inner: 0
> > > tx_queue_stopped: 1
> > > tx_queue_dropped: 0
> > > tx_xmit_more: 748638099
> > > tx_recover: 0
> > > tx_cqes: 73881645031
> > > tx_queue_wake: 1
> > > tx_udp_seg_rem: 0
> > > tx_cqe_err: 0
> > > tx_xdp_xmit: 0
> > > tx_xdp_full: 0
> > > tx_xdp_err: 0
> > > tx_xdp_cqes: 0
> > > rx_wqe_err: 0
> > > rx_mpwqe_filler_cqes: 0
> > > rx_mpwqe_filler_strides: 0
> > > rx_buff_alloc_err: 0
> > > rx_cqe_compress_blks: 0
> > > rx_cqe_compress_pkts: 0
> >
> > If this is a pcie bottleneck it might be useful to enable CQE
> > compression (to reduce PCIe completion descriptors transactions)
> > you should see the above rx_cqe_compress_pkts increasing when
> > enabled.
> >
> > $ ethtool --set-priv-flags enp175s0f1 rx_cqe_compress on
> > $ ethtool --show-priv-flags enp175s0f1
> > Private flags for p6p1:
> > rx_cqe_moder : on
> > cqe_moder : off
> > rx_cqe_compress : on
> > ...
> >
> > try this on both interfaces.
>
> Done
> ethtool --show-priv-flags enp175s0f1
> Private flags for enp175s0f1:
> rx_cqe_moder : on
> tx_cqe_moder : off
> rx_cqe_compress : on
> rx_striding_rq : off
> rx_no_csum_complete: off
>
> ethtool --show-priv-flags enp175s0f0
> Private flags for enp175s0f0:
> rx_cqe_moder : on
> tx_cqe_moder : off
> rx_cqe_compress : on
> rx_striding_rq : off
> rx_no_csum_complete: off
>
did it help reduce the load on the pcie ? do you see more pps ?
what is the ratio between rx_cqe_compress_pkts and over all rx packets
?
[...]
> > > ethtool -S enp175s0f0
> > > NIC statistics:
> > > rx_packets: 141574897253
> > > rx_bytes: 184445040406258
> > > tx_packets: 172569543894
> > > tx_bytes: 99486882076365
> > > tx_tso_packets: 9367664195
> > > tx_tso_bytes: 56435233992948
> > > tx_tso_inner_packets: 0
> > > tx_tso_inner_bytes: 0
> > > tx_added_vlan_packets: 141297671626
> > > tx_nop: 2102916272
> > > rx_lro_packets: 0
> > > rx_lro_bytes: 0
> > > rx_ecn_mark: 0
> > > rx_removed_vlan_packets: 141574897252
> > > rx_csum_unnecessary: 0
> > > rx_csum_none: 23135854
> > > rx_csum_complete: 141551761398
> > > rx_csum_unnecessary_inner: 0
> > > rx_xdp_drop: 0
> > > rx_xdp_redirect: 0
> > > rx_xdp_tx_xmit: 0
> > > rx_xdp_tx_full: 0
> > > rx_xdp_tx_err: 0
> > > rx_xdp_tx_cqe: 0
> > > tx_csum_none: 127934791664
> >
> > It is a good idea to look into this, tx is not requesting hw tx
> > csumming for a lot of packets, maybe you are wasting a lot of cpu
> > on
> > calculating csum, or maybe this is just the rx csum complete..
> >
> > > tx_csum_partial: 13362879974
> > > tx_csum_partial_inner: 0
> > > tx_queue_stopped: 232561
> >
> > TX queues are stalling, could be an indentation for the pcie
> > bottelneck.
> >
> > > tx_queue_dropped: 0
> > > tx_xmit_more: 1266021946
> > > tx_recover: 0
> > > tx_cqes: 140031716469
> > > tx_queue_wake: 232561
> > > tx_udp_seg_rem: 0
> > > tx_cqe_err: 0
> > > tx_xdp_xmit: 0
> > > tx_xdp_full: 0
> > > tx_xdp_err: 0
> > > tx_xdp_cqes: 0
> > > rx_wqe_err: 0
> > > rx_mpwqe_filler_cqes: 0
> > > rx_mpwqe_filler_strides: 0
> > > rx_buff_alloc_err: 0
> > > rx_cqe_compress_blks: 0
> > > rx_cqe_compress_pkts: 0
> > > rx_page_reuse: 0
> > > rx_cache_reuse: 16625975793
> > > rx_cache_full: 54161465914
> > > rx_cache_empty: 258048
> > > rx_cache_busy: 54161472735
> > > rx_cache_waive: 0
> > > rx_congst_umr: 0
> > > rx_arfs_err: 0
> > > ch_events: 40572621887
> > > ch_poll: 40885650979
> > > ch_arm: 40429276692
> > > ch_aff_change: 0
> > > ch_eq_rearm: 0
> > > rx_out_of_buffer: 2791690
> > > rx_if_down_packets: 74
> > > rx_vport_unicast_packets: 141843476308
> > > rx_vport_unicast_bytes: 185421265403318
> > > tx_vport_unicast_packets: 172569484005
> > > tx_vport_unicast_bytes: 100019940094298
> > > rx_vport_multicast_packets: 85122935
> > > rx_vport_multicast_bytes: 5761316431
> > > tx_vport_multicast_packets: 6452
> > > tx_vport_multicast_bytes: 643540
> > > rx_vport_broadcast_packets: 22423624
> > > rx_vport_broadcast_bytes: 1390127090
> > > tx_vport_broadcast_packets: 22024
> > > tx_vport_broadcast_bytes: 1321440
> > > rx_vport_rdma_unicast_packets: 0
> > > rx_vport_rdma_unicast_bytes: 0
> > > tx_vport_rdma_unicast_packets: 0
> > > tx_vport_rdma_unicast_bytes: 0
> > > rx_vport_rdma_multicast_packets: 0
> > > rx_vport_rdma_multicast_bytes: 0
> > > tx_vport_rdma_multicast_packets: 0
> > > tx_vport_rdma_multicast_bytes: 0
> > > tx_packets_phy: 172569501577
> > > rx_packets_phy: 142871314588
> > > rx_crc_errors_phy: 0
> > > tx_bytes_phy: 100710212814151
> > > rx_bytes_phy: 187209224289564
> > > tx_multicast_phy: 6452
> > > tx_broadcast_phy: 22024
> > > rx_multicast_phy: 85122933
> > > rx_broadcast_phy: 22423623
> > > rx_in_range_len_errors_phy: 2
> > > rx_out_of_range_len_phy: 0
> > > rx_oversize_pkts_phy: 0
> > > rx_symbol_err_phy: 0
> > > tx_mac_control_phy: 0
> > > rx_mac_control_phy: 0
> > > rx_unsupported_op_phy: 0
> > > rx_pause_ctrl_phy: 0
> > > tx_pause_ctrl_phy: 0
> > > rx_discards_phy: 920161423
> >
> > Ok, this port seem to be suffering more, RX is congested, maybe due
> > to
> > the pcie bottleneck.
>
> Yes this side is receiving more traffic - second port is +10G more tx
>
[...]
> > > Average: 17 0.00 0.00 16.60 0.00 0.00 52.10
> > > 0.00 0.00 0.00 31.30
> > > Average: 18 0.00 0.00 13.90 0.00 0.00 61.20
> > > 0.00 0.00 0.00 24.90
> > > Average: 19 0.00 0.00 9.99 0.00 0.00 70.33
> > > 0.00 0.00 0.00 19.68
> > > Average: 20 0.00 0.00 9.00 0.00 0.00 73.00
> > > 0.00 0.00 0.00 18.00
> > > Average: 21 0.00 0.00 8.70 0.00 0.00 73.90
> > > 0.00 0.00 0.00 17.40
> > > Average: 22 0.00 0.00 15.42 0.00 0.00 58.56
> > > 0.00 0.00 0.00 26.03
> > > Average: 23 0.00 0.00 10.81 0.00 0.00 71.67
> > > 0.00 0.00 0.00 17.52
> > > Average: 24 0.00 0.00 10.00 0.00 0.00 71.80
> > > 0.00 0.00 0.00 18.20
> > > Average: 25 0.00 0.00 11.19 0.00 0.00 71.13
> > > 0.00 0.00 0.00 17.68
> > > Average: 26 0.00 0.00 11.00 0.00 0.00 70.80
> > > 0.00 0.00 0.00 18.20
> > > Average: 27 0.00 0.00 10.01 0.00 0.00 69.57
> > > 0.00 0.00 0.00 20.42
> >
> > The numa cores are not at 100% util, you have around 20% of idle on
> > each one.
>
> Yes - no 100% cpu - but the difference between 80% and 100% is like
> push
> aditional 1-2Gbit/s
>
yes but, it doens't look like the bottleneck is the cpu, although it is
close to be :)..
> >
> > > Average: 28 0.00 0.00 0.00 0.00 0.00 0.00
> > > 0.00
> > > 0.00 0.00 100.00
> > > Average: 29 0.00 0.00 0.00 0.00 0.00 0.00
> > > 0.00
> > > 0.00 0.00 100.00
> > > Average: 30 0.00 0.00 0.00 0.00 0.00 0.00
> > > 0.00
> > > 0.00 0.00 100.00
> > > Average: 31 0.00 0.00 0.00 0.00 0.00 0.00
> > > 0.00
> > > 0.00 0.00 100.00
> > > Average: 32 0.00 0.00 0.00 0.00 0.00 0.00
> > > 0.00
> > > 0.00 0.00 100.00
> > > Average: 33 0.00 0.00 3.90 0.00 0.00 0.00
> > > 0.00
> > > 0.00 0.00 96.10
> > > Average: 34 0.00 0.00 0.00 0.00 0.00 0.00
> > > 0.00
> > > 0.00 0.00 100.00
> > > Average: 35 0.00 0.00 0.00 0.00 0.00 0.00
> > > 0.00
> > > 0.00 0.00 100.00
> > > Average: 36 0.10 0.00 0.20 0.00 0.00 0.00
> > > 0.00
> > > 0.00 0.00 99.70
> > > Average: 37 0.20 0.00 0.30 0.00 0.00 0.00
> > > 0.00
> > > 0.00 0.00 99.50
> > > Average: 38 0.00 0.00 0.00 0.00 0.00 0.00
> > > 0.00
> > > 0.00 0.00 100.00
> > > Average: 39 0.00 0.00 2.60 0.00 0.00 0.00
> > > 0.00
> > > 0.00 0.00 97.40
> > > Average: 40 0.00 0.00 0.90 0.00 0.00 0.00
> > > 0.00
> > > 0.00 0.00 99.10
> > > Average: 41 0.10 0.00 0.50 0.00 0.00 0.00
> > > 0.00
> > > 0.00 0.00 99.40
> > > Average: 42 0.00 0.00 9.91 0.00 0.00 70.67
> > > 0.00 0.00 0.00 19.42
> > > Average: 43 0.00 0.00 15.90 0.00 0.00 57.50
> > > 0.00 0.00 0.00 26.60
> > > Average: 44 0.00 0.00 12.20 0.00 0.00 66.20
> > > 0.00 0.00 0.00 21.60
> > > Average: 45 0.00 0.00 12.00 0.00 0.00 67.50
> > > 0.00 0.00 0.00 20.50
> > > Average: 46 0.00 0.00 12.90 0.00 0.00 65.50
> > > 0.00 0.00 0.00 21.60
> > > Average: 47 0.00 0.00 14.59 0.00 0.00 60.84
> > > 0.00 0.00 0.00 24.58
> > > Average: 48 0.00 0.00 13.59 0.00 0.00 61.74
> > > 0.00 0.00 0.00 24.68
> > > Average: 49 0.00 0.00 18.36 0.00 0.00 53.29
> > > 0.00 0.00 0.00 28.34
> > > Average: 50 0.00 0.00 15.32 0.00 0.00 58.86
> > > 0.00 0.00 0.00 25.83
> > > Average: 51 0.00 0.00 17.60 0.00 0.00 55.20
> > > 0.00 0.00 0.00 27.20
> > > Average: 52 0.00 0.00 15.92 0.00 0.00 56.06
> > > 0.00 0.00 0.00 28.03
> > > Average: 53 0.00 0.00 13.00 0.00 0.00 62.30
> > > 0.00 0.00 0.00 24.70
> > > Average: 54 0.00 0.00 13.20 0.00 0.00 61.50
> > > 0.00 0.00 0.00 25.30
> > > Average: 55 0.00 0.00 14.59 0.00 0.00 58.64
> > > 0.00 0.00 0.00 26.77
> > >
> > >
> > > ethtool -k enp175s0f0
> > > Features for enp175s0f0:
> > > rx-checksumming: on
> > > tx-checksumming: on
> > > tx-checksum-ipv4: on
> > > tx-checksum-ip-generic: off [fixed]
> > > tx-checksum-ipv6: on
> > > tx-checksum-fcoe-crc: off [fixed]
> > > tx-checksum-sctp: off [fixed]
> > > scatter-gather: on
> > > tx-scatter-gather: on
> > > tx-scatter-gather-fraglist: off [fixed]
> > > tcp-segmentation-offload: on
> > > tx-tcp-segmentation: on
> > > tx-tcp-ecn-segmentation: off [fixed]
> > > tx-tcp-mangleid-segmentation: off
> > > tx-tcp6-segmentation: on
> > > udp-fragmentation-offload: off
> > > generic-segmentation-offload: on
> > > generic-receive-offload: on
> > > large-receive-offload: off [fixed]
> > > rx-vlan-offload: on
> > > tx-vlan-offload: on
> > > ntuple-filters: off
> > > receive-hashing: on
> > > highdma: on [fixed]
> > > rx-vlan-filter: on
> > > vlan-challenged: off [fixed]
> > > tx-lockless: off [fixed]
> > > netns-local: off [fixed]
> > > tx-gso-robust: off [fixed]
> > > tx-fcoe-segmentation: off [fixed]
> > > tx-gre-segmentation: on
> > > tx-gre-csum-segmentation: on
> > > tx-ipxip4-segmentation: off [fixed]
> > > tx-ipxip6-segmentation: off [fixed]
> > > tx-udp_tnl-segmentation: on
> > > tx-udp_tnl-csum-segmentation: on
> > > tx-gso-partial: on
> > > tx-sctp-segmentation: off [fixed]
> > > tx-esp-segmentation: off [fixed]
> > > tx-udp-segmentation: on
> > > fcoe-mtu: off [fixed]
> > > tx-nocache-copy: off
> > > loopback: off [fixed]
> > > rx-fcs: off
> > > rx-all: off
> > > tx-vlan-stag-hw-insert: on
> > > rx-vlan-stag-hw-parse: off [fixed]
> > > rx-vlan-stag-filter: on [fixed]
> > > l2-fwd-offload: off [fixed]
> > > hw-tc-offload: off
> > > esp-hw-offload: off [fixed]
> > > esp-tx-csum-hw-offload: off [fixed]
> > > rx-udp_tunnel-port-offload: on
> > > tls-hw-tx-offload: off [fixed]
> > > tls-hw-rx-offload: off [fixed]
> > > rx-gro-hw: off [fixed]
> > > tls-hw-record: off [fixed]
> > >
> > > ethtool -c enp175s0f0
> > > Coalesce parameters for enp175s0f0:
> > > Adaptive RX: off TX: on
> > > stats-block-usecs: 0
> > > sample-interval: 0
> > > pkt-rate-low: 0
> > > pkt-rate-high: 0
> > > dmac: 32703
> > >
> > > rx-usecs: 256
> > > rx-frames: 128
> > > rx-usecs-irq: 0
> > > rx-frames-irq: 0
> > >
> > > tx-usecs: 8
> > > tx-frames: 128
> > > tx-usecs-irq: 0
> > > tx-frames-irq: 0
> > >
> > > rx-usecs-low: 0
> > > rx-frame-low: 0
> > > tx-usecs-low: 0
> > > tx-frame-low: 0
> > >
> > > rx-usecs-high: 0
> > > rx-frame-high: 0
> > > tx-usecs-high: 0
> > > tx-frame-high: 0
> > >
> > > ethtool -g enp175s0f0
> > > Ring parameters for enp175s0f0:
> > > Pre-set maximums:
> > > RX: 8192
> > > RX Mini: 0
> > > RX Jumbo: 0
> > > TX: 8192
> > > Current hardware settings:
> > > RX: 4096
> > > RX Mini: 0
> > > RX Jumbo: 0
> > > TX: 4096
> > >
> > >
> > >
> > >
> > >
> > >
>
> Also changed a little coalesce params - and best for this config are:
> ethtool -c enp175s0f0
> Coalesce parameters for enp175s0f0:
> Adaptive RX: off TX: off
> stats-block-usecs: 0
> sample-interval: 0
> pkt-rate-low: 0
> pkt-rate-high: 0
> dmac: 32573
>
> rx-usecs: 40
> rx-frames: 128
> rx-usecs-irq: 0
> rx-frames-irq: 0
>
> tx-usecs: 8
> tx-frames: 8
> tx-usecs-irq: 0
> tx-frames-irq: 0
>
> rx-usecs-low: 0
> rx-frame-low: 0
> tx-usecs-low: 0
> tx-frame-low: 0
>
> rx-usecs-high: 0
> rx-frame-high: 0
> tx-usecs-high: 0
> tx-frame-high: 0
>
>
> Less drops on RX side - and more pps in overall forwarded.
>
how much improvement ? maybe we can improve our adaptive rx coal to be
efficient for this work load.
Powered by blists - more mailing lists