[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <162e25c6-dae2-7e1e-75f0-9c5b22453495@itcare.pl>
Date: Thu, 8 Nov 2018 20:12:44 +0100
From: Paweł Staszewski <pstaszewski@...are.pl>
To: Saeed Mahameed <saeedm@...lanox.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: Kernel 4.19 network performance - forwarding/routing normal users
traffic
W dniu 03.11.2018 o 01:18, Paweł Staszewski pisze:
>
>
> W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze:
>> On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote:
>>> W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze:
>>>> On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote:
>>>>> Hi
>>>>>
>>>>> So maybee someone will be interested how linux kernel handles
>>>>> normal
>>>>> traffic (not pktgen :) )
>>>>>
>>>>>
>>>>> Server HW configuration:
>>>>>
>>>>> CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
>>>>>
>>>>> NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
>>>>>
>>>>>
>>>>> Server software:
>>>>>
>>>>> FRR - as routing daemon
>>>>>
>>>>> enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to
>>>>> local
>>>>> numa
>>>>> node)
>>>>>
>>>>> enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local
>>>>> numa
>>>>> node)
>>>>>
>>>>>
>>>>> Maximum traffic that server can handle:
>>>>>
>>>>> Bandwidth
>>>>>
>>>>> bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>>>>> input: /proc/net/dev type: rate
>>>>> \ iface Rx Tx Total
>>>>> =================================================================
>>>>> ====
>>>>> =========
>>>>> enp175s0f1: 28.51 Gb/s 37.24
>>>>> Gb/s
>>>>> 65.74 Gb/s
>>>>> enp175s0f0: 38.07 Gb/s 28.44
>>>>> Gb/s
>>>>> 66.51 Gb/s
>>>>> ---------------------------------------------------------------
>>>>> ----
>>>>> -----------
>>>>> total: 66.58 Gb/s 65.67
>>>>> Gb/s
>>>>> 132.25 Gb/s
>>>>>
>>>>>
>>>>> Packets per second:
>>>>>
>>>>> bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>>>>> input: /proc/net/dev type: rate
>>>>> - iface Rx Tx Total
>>>>> =================================================================
>>>>> ====
>>>>> =========
>>>>> enp175s0f1: 5248589.00 P/s 3486617.75 P/s
>>>>> 8735207.00 P/s
>>>>> enp175s0f0: 3557944.25 P/s 5232516.00 P/s
>>>>> 8790460.00 P/s
>>>>> ---------------------------------------------------------------
>>>>> ----
>>>>> -----------
>>>>> total: 8806533.00 P/s 8719134.00 P/s
>>>>> 17525668.00 P/s
>>>>>
>>>>>
>>>>> After reaching that limits nics on the upstream side (more RX
>>>>> traffic)
>>>>> start to drop packets
>>>>>
>>>>>
>>>>> I just dont understand that server can't handle more bandwidth
>>>>> (~40Gbit/s is limit where all cpu's are 100% util) - where pps on
>>>>> RX
>>>>> side are increasing.
>>>>>
>>>> Where do you see 40 Gb/s ? you showed that both ports on the same
>>>> NIC (
>>>> same pcie link) are doing 66.58 Gb/s (RX) + 65.67 Gb/s (TX) =
>>>> 132.25
>>>> Gb/s which aligns with your pcie link limit, what am i missing ?
>>> hmm yes that was my concern also - cause cant find anywhere
>>> informations
>>> about that bandwidth is uni or bidirectional - so if 126Gbit for x16
>>> 8GT
>>> is unidir - then bidir will be 126/2 ~68Gbit - which will fit total
>>> bw
>>> on both ports
>> i think it is bidir
> So yes - we are hitting there other problem i think pcie is most
> probabbly bidirectional max bw 126Gbit so RX 126Gbit and at same time
> TX should be 126Gbit
>
>
So one 2-port 100G card connectx4 replaced with two separate connectx5
placed in two different pcie x16 gen 3.0
lspci -vvv -s af:00.0
af:00.0 Ethernet controller: Mellanox Technologies MT27800 Family
[ConnectX-5]
Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 90
NUMA node: 1
Region 0: Memory at 39bffe000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at ee600000 [disabled] [size=1M]
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s
unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
SlotPowerLimit 0.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
FLReset-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+
AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not
supported, Exit Latency L0s unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+,
LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-,
LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance-
SpeedDis-
Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB,
EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+,
LinkEqualizationRequest-
Capabilities: [48] Vital Product Data
Product Name: CX515A - ConnectX-5 QSFP28
Read-only fields:
[PN] Part number: MCX515A-CCAT
[EC] Engineering changes: A6
[V2] Vendor specific: MCX515A-CCAT
[SN] Serial number: MT1831J00221
[V3] Vendor specific:
14a5c73bee92e811800098039b1ee5f0
[VA] Vendor specific:
MLX:MODL=CX515A:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0
[V0] Vendor specific: PCIeGen3 x16
[RV] Reserved: checksum good, 2 byte(s) reserved
End
Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA
PME(D0-,D1-,D2-,D3hot-,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt-
UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr+
AERCap: First Error Pointer: 04, GenCap+ CGenEn-
ChkCap+ ChkEn-
Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
IOVSta: Migration-
Initial VFs: 0, Total VFs: 0, Number of VFs: 0,
Function Dependency Link: 00
VF offset: 1, stride: 1, Device ID: 1018
Supported Page Size: 000007ff, System Page Size: 00000001
Region 0: Memory at 0000000000000000 (64-bit, prefetchable)
VF Migration: offset: 00000000, BIR: 0
Capabilities: [1c0 v1] #19
Kernel driver in use: mlx5_core
d8:00.0 Ethernet controller: Mellanox Technologies MT27800 Family
[ConnectX-5]
Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 159
NUMA node: 1
Region 0: Memory at 39fffe000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at fbe00000 [disabled] [size=1M]
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s
unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
SlotPowerLimit 0.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
FLReset-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+
AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not
supported, Exit Latency L0s unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+,
LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-,
LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance-
SpeedDis-
Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB,
EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+,
LinkEqualizationRequest-
Capabilities: [48] Vital Product Data
Product Name: CX515A - ConnectX-5 QSFP28
Read-only fields:
[PN] Part number: MCX515A-CCAT
[EC] Engineering changes: A6
[V2] Vendor specific: MCX515A-CCAT
[SN] Serial number: MT1831J00169
[V3] Vendor specific:
c06757e6e092e811800098039b1ee520
[VA] Vendor specific:
MLX:MODL=CX515A:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0
[V0] Vendor specific: PCIeGen3 x16
[RV] Reserved: checksum good, 2 byte(s) reserved
End
Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA
PME(D0-,D1-,D2-,D3hot-,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt-
UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr+
AERCap: First Error Pointer: 04, GenCap+ CGenEn-
ChkCap+ ChkEn-
Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
IOVSta: Migration-
Initial VFs: 0, Total VFs: 0, Number of VFs: 0,
Function Dependency Link: 00
VF offset: 1, stride: 1, Device ID: 1018
Supported Page Size: 000007ff, System Page Size: 00000001
Region 0: Memory at 0000000000000000 (64-bit, prefetchable)
VF Migration: offset: 00000000, BIR: 0
Capabilities: [1c0 v1] #19
Kernel driver in use: mlx5_core
CPU load is lower than for connectx4 - but it looks like bandwidth limit
is the same :)
But also after reaching 60Gbit/60Gbit
bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
input: /proc/net/dev type: rate
- iface Rx Tx Total
==============================================================================
enp175s0: 45.09 Gb/s 15.09 Gb/s
60.18 Gb/s
enp216s0: 15.14 Gb/s 45.19 Gb/s
60.33 Gb/s
------------------------------------------------------------------------------
total: 60.45 Gb/s 60.48 Gb/s 120.93 Gb/s
Nics start to drop packets (discards from nic's where is more rx traffic):
ethtool -S enp175s0 |grep 'disc'
rx_discards_phy: 47265611
after 20 secs
ethtool -S enp175s0 |grep 'disc'
rx_discards_phy: 49434472
current coalescence params:
ethtool -c enp175s0
Coalesce parameters for enp175s0:
Adaptive RX: off TX: on
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
dmac: 32651
rx-usecs: 128
rx-frames: 128
rx-usecs-irq: 0
rx-frames-irq: 0
tx-usecs: 8
tx-frames: 128
tx-usecs-irq: 0
tx-frames-irq: 0
rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0
rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0
and perf top:
PerfTop: 86898 irqs/sec kernel:99.5% exact: 0.0% [4000Hz
cycles], (all, 56 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
12.76% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_linear
8.68% [kernel] [k] mlx5e_sq_xmit
6.47% [kernel] [k] build_skb
4.78% [kernel] [k] fib_table_lookup
4.58% [kernel] [k] memcpy_erms
3.47% [kernel] [k] mlx5e_poll_rx_cq
2.59% [kernel] [k] mlx5e_handle_rx_cqe_mpwrq
2.37% [kernel] [k] mlx5e_post_rx_mpwqes
2.33% [kernel] [k] vlan_do_receive
1.94% [kernel] [k] __dev_queue_xmit
1.89% [kernel] [k] mlx5e_poll_tx_cq
1.74% [kernel] [k] ip_finish_output2
1.67% [kernel] [k] dev_gro_receive
1.64% [kernel] [k] ipt_do_table
1.58% [kernel] [k] tcp_gro_receive
1.49% [kernel] [k] pfifo_fast_dequeue
1.28% [kernel] [k] mlx5_eq_int
1.26% [kernel] [k] inet_gro_receive
1.26% [kernel] [k] _raw_spin_lock
1.20% [kernel] [k] __netif_receive_skb_core
1.19% [kernel] [k] irq_entries_start
1.17% [kernel] [k] swiotlb_map_page
1.13% [kernel] [k] vlan_dev_hard_start_xmit
1.12% [kernel] [k] ip_route_input_rcu
0.97% [kernel] [k] __build_skb
0.84% [kernel] [k] _raw_spin_lock_irqsave
0.78% [kernel] [k] kmem_cache_alloc
0.77% [kernel] [k] mlx5e_xmit
0.77% [kernel] [k] dev_hard_start_xmit
0.76% [kernel] [k] ip_forward
0.73% [kernel] [k] netif_skb_features
0.70% [kernel] [k] tasklet_action_common.isra.21
0.58% [kernel] [k] validate_xmit_skb.isra.142
0.55% [kernel] [k] ip_rcv_core.isra.20.constprop.25
0.55% [kernel] [k] mlx5e_page_release
0.55% [kernel] [k] __qdisc_run
0.51% [kernel] [k] __memcpy
0.48% [kernel] [k] kmem_cache_free_bulk
0.48% [kernel] [k] page_frag_free
0.47% [kernel] [k] inet_lookup_ifaddr_rcu
0.47% [kernel] [k] queued_spin_lock_slowpath
0.46% [kernel] [k] pfifo_fast_enqueue
0.43% [kernel] [k] tcp4_gro_receive
0.40% [kernel] [k] skb_gro_receive
0.39% [kernel] [k] skb_release_data
0.38% [kernel] [k] find_busiest_group
0.36% [kernel] [k] _raw_spin_trylock
0.36% [kernel] [k] skb_segment
0.33% [kernel] [k] eth_type_trans
0.32% [kernel] [k] __sched_text_start
0.32% [kernel] [k] __netif_schedule
0.32% [kernel] [k] try_to_wake_up
0.31% [kernel] [k] _raw_spin_lock_irq
0.31% [kernel] [k] __local_bh_enable_ip
Also mpstat:
Average: CPU %usr %nice %sys %iowait %irq %soft %steal
%guest %gnice %idle
Average: all 0.06 0.00 1.00 0.02 0.00 21.61 0.00
0.00 0.00 77.32
Average: 0 0.00 0.00 0.60 0.00 0.00 0.00 0.00
0.00 0.00 99.40
Average: 1 0.10 0.00 1.30 0.00 0.00 0.00 0.00
0.00 0.00 98.60
Average: 2 0.00 0.00 0.20 0.00 0.00 0.00 0.00
0.00 0.00 99.80
Average: 3 0.00 0.00 1.60 0.00 0.00 0.00 0.00
0.00 0.00 98.40
Average: 4 0.00 0.00 1.00 0.00 0.00 0.00 0.00
0.00 0.00 99.00
Average: 5 0.20 0.00 4.60 0.00 0.00 0.00 0.00
0.00 0.00 95.20
Average: 6 0.00 0.00 0.20 0.00 0.00 0.00 0.00
0.00 0.00 99.80
Average: 7 0.60 0.00 3.00 0.00 0.00 0.00 0.00
0.00 0.00 96.40
Average: 8 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 9 0.70 0.00 0.30 0.00 0.00 0.00 0.00
0.00 0.00 99.00
Average: 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 11 0.00 0.00 2.00 0.00 0.00 0.00 0.00
0.00 0.00 98.00
Average: 12 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 13 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 14 0.00 0.00 1.00 0.00 0.00 50.40 0.00
0.00 0.00 48.60
Average: 15 0.00 0.00 1.30 0.00 0.00 47.90 0.00
0.00 0.00 50.80
Average: 16 0.00 0.00 2.00 0.00 0.00 47.80 0.00
0.00 0.00 50.20
Average: 17 0.00 0.00 1.30 0.00 0.00 50.20 0.00
0.00 0.00 48.50
Average: 18 0.10 0.00 1.10 0.00 0.00 42.40 0.00
0.00 0.00 56.40
Average: 19 0.00 0.00 1.50 0.00 0.00 44.40 0.00
0.00 0.00 54.10
Average: 20 0.00 0.00 1.40 0.00 0.00 45.90 0.00
0.00 0.00 52.70
Average: 21 0.00 0.00 0.70 0.00 0.00 44.50 0.00
0.00 0.00 54.80
Average: 22 0.10 0.00 1.40 0.00 0.00 47.00 0.00
0.00 0.00 51.50
Average: 23 0.00 0.00 0.30 0.00 0.00 45.50 0.00
0.00 0.00 54.20
Average: 24 0.00 0.00 1.60 0.00 0.00 50.00 0.00
0.00 0.00 48.40
Average: 25 0.10 0.00 0.70 0.00 0.00 47.00 0.00
0.00 0.00 52.20
Average: 26 0.00 0.00 1.80 0.00 0.00 48.70 0.00
0.00 0.00 49.50
Average: 27 0.00 0.00 1.10 0.00 0.00 44.80 0.00
0.00 0.00 54.10
Average: 28 0.30 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 99.70
Average: 29 0.10 0.00 0.60 0.00 0.00 0.00 0.00
0.00 0.00 99.30
Average: 30 0.00 0.00 0.20 0.00 0.00 0.00 0.00
0.00 0.00 99.80
Average: 31 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 32 0.00 0.00 1.20 0.00 0.00 0.00 0.00
0.00 0.00 98.80
Average: 33 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 34 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 35 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 36 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 37 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 38 0.20 0.00 0.80 0.00 0.00 0.00 0.00
0.00 0.00 99.00
Average: 39 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 40 0.00 0.00 3.30 0.00 0.00 0.00 0.00
0.00 0.00 96.70
Average: 41 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 42 0.00 0.00 0.80 0.00 0.00 45.00 0.00
0.00 0.00 54.20
Average: 43 0.00 0.00 1.60 0.00 0.00 48.30 0.00
0.00 0.00 50.10
Average: 44 0.00 0.00 1.60 0.00 0.00 37.90 0.00
0.00 0.00 60.50
Average: 45 0.30 0.00 1.40 0.00 0.00 32.90 0.00
0.00 0.00 65.40
Average: 46 0.00 0.00 1.50 0.90 0.00 37.60 0.00
0.00 0.00 60.00
Average: 47 0.10 0.00 0.40 0.00 0.00 41.40 0.00
0.00 0.00 58.10
Average: 48 0.20 0.00 1.70 0.00 0.00 38.20 0.00
0.00 0.00 59.90
Average: 49 0.00 0.00 1.40 0.00 0.00 37.20 0.00
0.00 0.00 61.40
Average: 50 0.00 0.00 1.30 0.00 0.00 38.10 0.00
0.00 0.00 60.60
Average: 51 0.00 0.00 0.80 0.00 0.00 39.40 0.00
0.00 0.00 59.80
Average: 52 0.00 0.00 1.70 0.00 0.00 39.50 0.00
0.00 0.00 58.80
Average: 53 0.10 0.00 0.90 0.00 0.00 38.20 0.00
0.00 0.00 60.80
Average: 54 0.00 0.00 1.30 0.00 0.00 42.10 0.00
0.00 0.00 56.60
Average: 55 0.00 0.00 1.60 0.00 0.00 37.70 0.00
0.00 0.00 60.70
So it looks like previously there was also no problem with pciexpress x16
>
>
>
>>> This can explain maybee also why cpuload is rising rapidly from
>>> 120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net -
>>> so
>>> there can be some error in reading them when offloading (gro/gso/tso)
>>> on
>>> nic's is enabled that is why
>>>
>>>>> Was thinking that maybee reached some pcie x16 limit - but x16
>>>>> 8GT
>>>>> is
>>>>> 126Gbit - and also when testing with pktgen i can reach more bw
>>>>> and
>>>>> pps
>>>>> (like 4x more comparing to normal internet traffic)
>>>>>
>>>> Are you forwarding when using pktgen as well or you just testing
>>>> the RX
>>>> side pps ?
>>> Yes pktgen was tested on single port RX
>>> Can check also forwarding to eliminate pciex limits
>>>
>> So this explains why you have more RX pps, since tx is idle and pcie
>> will be free to do only rx.
>>
>> [...]
>>
>>
>>>>> ethtool -S enp175s0f1
>>>>> NIC statistics:
>>>>> rx_packets: 173730800927
>>>>> rx_bytes: 99827422751332
>>>>> tx_packets: 142532009512
>>>>> tx_bytes: 184633045911222
>>>>> tx_tso_packets: 25989113891
>>>>> tx_tso_bytes: 132933363384458
>>>>> tx_tso_inner_packets: 0
>>>>> tx_tso_inner_bytes: 0
>>>>> tx_added_vlan_packets: 74630239613
>>>>> tx_nop: 2029817748
>>>>> rx_lro_packets: 0
>>>>> rx_lro_bytes: 0
>>>>> rx_ecn_mark: 0
>>>>> rx_removed_vlan_packets: 173730800927
>>>>> rx_csum_unnecessary: 0
>>>>> rx_csum_none: 434357
>>>>> rx_csum_complete: 173730366570
>>>>> rx_csum_unnecessary_inner: 0
>>>>> rx_xdp_drop: 0
>>>>> rx_xdp_redirect: 0
>>>>> rx_xdp_tx_xmit: 0
>>>>> rx_xdp_tx_full: 0
>>>>> rx_xdp_tx_err: 0
>>>>> rx_xdp_tx_cqe: 0
>>>>> tx_csum_none: 38260960853
>>>>> tx_csum_partial: 36369278774
>>>>> tx_csum_partial_inner: 0
>>>>> tx_queue_stopped: 1
>>>>> tx_queue_dropped: 0
>>>>> tx_xmit_more: 748638099
>>>>> tx_recover: 0
>>>>> tx_cqes: 73881645031
>>>>> tx_queue_wake: 1
>>>>> tx_udp_seg_rem: 0
>>>>> tx_cqe_err: 0
>>>>> tx_xdp_xmit: 0
>>>>> tx_xdp_full: 0
>>>>> tx_xdp_err: 0
>>>>> tx_xdp_cqes: 0
>>>>> rx_wqe_err: 0
>>>>> rx_mpwqe_filler_cqes: 0
>>>>> rx_mpwqe_filler_strides: 0
>>>>> rx_buff_alloc_err: 0
>>>>> rx_cqe_compress_blks: 0
>>>>> rx_cqe_compress_pkts: 0
>>>> If this is a pcie bottleneck it might be useful to enable CQE
>>>> compression (to reduce PCIe completion descriptors transactions)
>>>> you should see the above rx_cqe_compress_pkts increasing when
>>>> enabled.
>>>>
>>>> $ ethtool --set-priv-flags enp175s0f1 rx_cqe_compress on
>>>> $ ethtool --show-priv-flags enp175s0f1
>>>> Private flags for p6p1:
>>>> rx_cqe_moder : on
>>>> cqe_moder : off
>>>> rx_cqe_compress : on
>>>> ...
>>>>
>>>> try this on both interfaces.
>>> Done
>>> ethtool --show-priv-flags enp175s0f1
>>> Private flags for enp175s0f1:
>>> rx_cqe_moder : on
>>> tx_cqe_moder : off
>>> rx_cqe_compress : on
>>> rx_striding_rq : off
>>> rx_no_csum_complete: off
>>>
>>> ethtool --show-priv-flags enp175s0f0
>>> Private flags for enp175s0f0:
>>> rx_cqe_moder : on
>>> tx_cqe_moder : off
>>> rx_cqe_compress : on
>>> rx_striding_rq : off
>>> rx_no_csum_complete: off
>>>
>> did it help reduce the load on the pcie ? do you see more pps ?
>> what is the ratio between rx_cqe_compress_pkts and over all rx packets
>> ?
>>
>> [...]
>>
>>>>> ethtool -S enp175s0f0
>>>>> NIC statistics:
>>>>> rx_packets: 141574897253
>>>>> rx_bytes: 184445040406258
>>>>> tx_packets: 172569543894
>>>>> tx_bytes: 99486882076365
>>>>> tx_tso_packets: 9367664195
>>>>> tx_tso_bytes: 56435233992948
>>>>> tx_tso_inner_packets: 0
>>>>> tx_tso_inner_bytes: 0
>>>>> tx_added_vlan_packets: 141297671626
>>>>> tx_nop: 2102916272
>>>>> rx_lro_packets: 0
>>>>> rx_lro_bytes: 0
>>>>> rx_ecn_mark: 0
>>>>> rx_removed_vlan_packets: 141574897252
>>>>> rx_csum_unnecessary: 0
>>>>> rx_csum_none: 23135854
>>>>> rx_csum_complete: 141551761398
>>>>> rx_csum_unnecessary_inner: 0
>>>>> rx_xdp_drop: 0
>>>>> rx_xdp_redirect: 0
>>>>> rx_xdp_tx_xmit: 0
>>>>> rx_xdp_tx_full: 0
>>>>> rx_xdp_tx_err: 0
>>>>> rx_xdp_tx_cqe: 0
>>>>> tx_csum_none: 127934791664
>>>> It is a good idea to look into this, tx is not requesting hw tx
>>>> csumming for a lot of packets, maybe you are wasting a lot of cpu
>>>> on
>>>> calculating csum, or maybe this is just the rx csum complete..
>>>>
>>>>> tx_csum_partial: 13362879974
>>>>> tx_csum_partial_inner: 0
>>>>> tx_queue_stopped: 232561
>>>> TX queues are stalling, could be an indentation for the pcie
>>>> bottelneck.
>>>>
>>>>> tx_queue_dropped: 0
>>>>> tx_xmit_more: 1266021946
>>>>> tx_recover: 0
>>>>> tx_cqes: 140031716469
>>>>> tx_queue_wake: 232561
>>>>> tx_udp_seg_rem: 0
>>>>> tx_cqe_err: 0
>>>>> tx_xdp_xmit: 0
>>>>> tx_xdp_full: 0
>>>>> tx_xdp_err: 0
>>>>> tx_xdp_cqes: 0
>>>>> rx_wqe_err: 0
>>>>> rx_mpwqe_filler_cqes: 0
>>>>> rx_mpwqe_filler_strides: 0
>>>>> rx_buff_alloc_err: 0
>>>>> rx_cqe_compress_blks: 0
>>>>> rx_cqe_compress_pkts: 0
>>>>> rx_page_reuse: 0
>>>>> rx_cache_reuse: 16625975793
>>>>> rx_cache_full: 54161465914
>>>>> rx_cache_empty: 258048
>>>>> rx_cache_busy: 54161472735
>>>>> rx_cache_waive: 0
>>>>> rx_congst_umr: 0
>>>>> rx_arfs_err: 0
>>>>> ch_events: 40572621887
>>>>> ch_poll: 40885650979
>>>>> ch_arm: 40429276692
>>>>> ch_aff_change: 0
>>>>> ch_eq_rearm: 0
>>>>> rx_out_of_buffer: 2791690
>>>>> rx_if_down_packets: 74
>>>>> rx_vport_unicast_packets: 141843476308
>>>>> rx_vport_unicast_bytes: 185421265403318
>>>>> tx_vport_unicast_packets: 172569484005
>>>>> tx_vport_unicast_bytes: 100019940094298
>>>>> rx_vport_multicast_packets: 85122935
>>>>> rx_vport_multicast_bytes: 5761316431
>>>>> tx_vport_multicast_packets: 6452
>>>>> tx_vport_multicast_bytes: 643540
>>>>> rx_vport_broadcast_packets: 22423624
>>>>> rx_vport_broadcast_bytes: 1390127090
>>>>> tx_vport_broadcast_packets: 22024
>>>>> tx_vport_broadcast_bytes: 1321440
>>>>> rx_vport_rdma_unicast_packets: 0
>>>>> rx_vport_rdma_unicast_bytes: 0
>>>>> tx_vport_rdma_unicast_packets: 0
>>>>> tx_vport_rdma_unicast_bytes: 0
>>>>> rx_vport_rdma_multicast_packets: 0
>>>>> rx_vport_rdma_multicast_bytes: 0
>>>>> tx_vport_rdma_multicast_packets: 0
>>>>> tx_vport_rdma_multicast_bytes: 0
>>>>> tx_packets_phy: 172569501577
>>>>> rx_packets_phy: 142871314588
>>>>> rx_crc_errors_phy: 0
>>>>> tx_bytes_phy: 100710212814151
>>>>> rx_bytes_phy: 187209224289564
>>>>> tx_multicast_phy: 6452
>>>>> tx_broadcast_phy: 22024
>>>>> rx_multicast_phy: 85122933
>>>>> rx_broadcast_phy: 22423623
>>>>> rx_in_range_len_errors_phy: 2
>>>>> rx_out_of_range_len_phy: 0
>>>>> rx_oversize_pkts_phy: 0
>>>>> rx_symbol_err_phy: 0
>>>>> tx_mac_control_phy: 0
>>>>> rx_mac_control_phy: 0
>>>>> rx_unsupported_op_phy: 0
>>>>> rx_pause_ctrl_phy: 0
>>>>> tx_pause_ctrl_phy: 0
>>>>> rx_discards_phy: 920161423
>>>> Ok, this port seem to be suffering more, RX is congested, maybe due
>>>> to
>>>> the pcie bottleneck.
>>> Yes this side is receiving more traffic - second port is +10G more tx
>>>
>> [...]
>>
>>
>>>>> Average: 17 0.00 0.00 16.60 0.00 0.00 52.10
>>>>> 0.00 0.00 0.00 31.30
>>>>> Average: 18 0.00 0.00 13.90 0.00 0.00 61.20
>>>>> 0.00 0.00 0.00 24.90
>>>>> Average: 19 0.00 0.00 9.99 0.00 0.00 70.33
>>>>> 0.00 0.00 0.00 19.68
>>>>> Average: 20 0.00 0.00 9.00 0.00 0.00 73.00
>>>>> 0.00 0.00 0.00 18.00
>>>>> Average: 21 0.00 0.00 8.70 0.00 0.00 73.90
>>>>> 0.00 0.00 0.00 17.40
>>>>> Average: 22 0.00 0.00 15.42 0.00 0.00 58.56
>>>>> 0.00 0.00 0.00 26.03
>>>>> Average: 23 0.00 0.00 10.81 0.00 0.00 71.67
>>>>> 0.00 0.00 0.00 17.52
>>>>> Average: 24 0.00 0.00 10.00 0.00 0.00 71.80
>>>>> 0.00 0.00 0.00 18.20
>>>>> Average: 25 0.00 0.00 11.19 0.00 0.00 71.13
>>>>> 0.00 0.00 0.00 17.68
>>>>> Average: 26 0.00 0.00 11.00 0.00 0.00 70.80
>>>>> 0.00 0.00 0.00 18.20
>>>>> Average: 27 0.00 0.00 10.01 0.00 0.00 69.57
>>>>> 0.00 0.00 0.00 20.42
>>>> The numa cores are not at 100% util, you have around 20% of idle on
>>>> each one.
>>> Yes - no 100% cpu - but the difference between 80% and 100% is like
>>> push
>>> aditional 1-2Gbit/s
>>>
>> yes but, it doens't look like the bottleneck is the cpu, although it is
>> close to be :)..
>>
>>>>> Average: 28 0.00 0.00 0.00 0.00 0.00 0.00
>>>>> 0.00
>>>>> 0.00 0.00 100.00
>>>>> Average: 29 0.00 0.00 0.00 0.00 0.00 0.00
>>>>> 0.00
>>>>> 0.00 0.00 100.00
>>>>> Average: 30 0.00 0.00 0.00 0.00 0.00 0.00
>>>>> 0.00
>>>>> 0.00 0.00 100.00
>>>>> Average: 31 0.00 0.00 0.00 0.00 0.00 0.00
>>>>> 0.00
>>>>> 0.00 0.00 100.00
>>>>> Average: 32 0.00 0.00 0.00 0.00 0.00 0.00
>>>>> 0.00
>>>>> 0.00 0.00 100.00
>>>>> Average: 33 0.00 0.00 3.90 0.00 0.00 0.00
>>>>> 0.00
>>>>> 0.00 0.00 96.10
>>>>> Average: 34 0.00 0.00 0.00 0.00 0.00 0.00
>>>>> 0.00
>>>>> 0.00 0.00 100.00
>>>>> Average: 35 0.00 0.00 0.00 0.00 0.00 0.00
>>>>> 0.00
>>>>> 0.00 0.00 100.00
>>>>> Average: 36 0.10 0.00 0.20 0.00 0.00 0.00
>>>>> 0.00
>>>>> 0.00 0.00 99.70
>>>>> Average: 37 0.20 0.00 0.30 0.00 0.00 0.00
>>>>> 0.00
>>>>> 0.00 0.00 99.50
>>>>> Average: 38 0.00 0.00 0.00 0.00 0.00 0.00
>>>>> 0.00
>>>>> 0.00 0.00 100.00
>>>>> Average: 39 0.00 0.00 2.60 0.00 0.00 0.00
>>>>> 0.00
>>>>> 0.00 0.00 97.40
>>>>> Average: 40 0.00 0.00 0.90 0.00 0.00 0.00
>>>>> 0.00
>>>>> 0.00 0.00 99.10
>>>>> Average: 41 0.10 0.00 0.50 0.00 0.00 0.00
>>>>> 0.00
>>>>> 0.00 0.00 99.40
>>>>> Average: 42 0.00 0.00 9.91 0.00 0.00 70.67
>>>>> 0.00 0.00 0.00 19.42
>>>>> Average: 43 0.00 0.00 15.90 0.00 0.00 57.50
>>>>> 0.00 0.00 0.00 26.60
>>>>> Average: 44 0.00 0.00 12.20 0.00 0.00 66.20
>>>>> 0.00 0.00 0.00 21.60
>>>>> Average: 45 0.00 0.00 12.00 0.00 0.00 67.50
>>>>> 0.00 0.00 0.00 20.50
>>>>> Average: 46 0.00 0.00 12.90 0.00 0.00 65.50
>>>>> 0.00 0.00 0.00 21.60
>>>>> Average: 47 0.00 0.00 14.59 0.00 0.00 60.84
>>>>> 0.00 0.00 0.00 24.58
>>>>> Average: 48 0.00 0.00 13.59 0.00 0.00 61.74
>>>>> 0.00 0.00 0.00 24.68
>>>>> Average: 49 0.00 0.00 18.36 0.00 0.00 53.29
>>>>> 0.00 0.00 0.00 28.34
>>>>> Average: 50 0.00 0.00 15.32 0.00 0.00 58.86
>>>>> 0.00 0.00 0.00 25.83
>>>>> Average: 51 0.00 0.00 17.60 0.00 0.00 55.20
>>>>> 0.00 0.00 0.00 27.20
>>>>> Average: 52 0.00 0.00 15.92 0.00 0.00 56.06
>>>>> 0.00 0.00 0.00 28.03
>>>>> Average: 53 0.00 0.00 13.00 0.00 0.00 62.30
>>>>> 0.00 0.00 0.00 24.70
>>>>> Average: 54 0.00 0.00 13.20 0.00 0.00 61.50
>>>>> 0.00 0.00 0.00 25.30
>>>>> Average: 55 0.00 0.00 14.59 0.00 0.00 58.64
>>>>> 0.00 0.00 0.00 26.77
>>>>>
>>>>>
>>>>> ethtool -k enp175s0f0
>>>>> Features for enp175s0f0:
>>>>> rx-checksumming: on
>>>>> tx-checksumming: on
>>>>> tx-checksum-ipv4: on
>>>>> tx-checksum-ip-generic: off [fixed]
>>>>> tx-checksum-ipv6: on
>>>>> tx-checksum-fcoe-crc: off [fixed]
>>>>> tx-checksum-sctp: off [fixed]
>>>>> scatter-gather: on
>>>>> tx-scatter-gather: on
>>>>> tx-scatter-gather-fraglist: off [fixed]
>>>>> tcp-segmentation-offload: on
>>>>> tx-tcp-segmentation: on
>>>>> tx-tcp-ecn-segmentation: off [fixed]
>>>>> tx-tcp-mangleid-segmentation: off
>>>>> tx-tcp6-segmentation: on
>>>>> udp-fragmentation-offload: off
>>>>> generic-segmentation-offload: on
>>>>> generic-receive-offload: on
>>>>> large-receive-offload: off [fixed]
>>>>> rx-vlan-offload: on
>>>>> tx-vlan-offload: on
>>>>> ntuple-filters: off
>>>>> receive-hashing: on
>>>>> highdma: on [fixed]
>>>>> rx-vlan-filter: on
>>>>> vlan-challenged: off [fixed]
>>>>> tx-lockless: off [fixed]
>>>>> netns-local: off [fixed]
>>>>> tx-gso-robust: off [fixed]
>>>>> tx-fcoe-segmentation: off [fixed]
>>>>> tx-gre-segmentation: on
>>>>> tx-gre-csum-segmentation: on
>>>>> tx-ipxip4-segmentation: off [fixed]
>>>>> tx-ipxip6-segmentation: off [fixed]
>>>>> tx-udp_tnl-segmentation: on
>>>>> tx-udp_tnl-csum-segmentation: on
>>>>> tx-gso-partial: on
>>>>> tx-sctp-segmentation: off [fixed]
>>>>> tx-esp-segmentation: off [fixed]
>>>>> tx-udp-segmentation: on
>>>>> fcoe-mtu: off [fixed]
>>>>> tx-nocache-copy: off
>>>>> loopback: off [fixed]
>>>>> rx-fcs: off
>>>>> rx-all: off
>>>>> tx-vlan-stag-hw-insert: on
>>>>> rx-vlan-stag-hw-parse: off [fixed]
>>>>> rx-vlan-stag-filter: on [fixed]
>>>>> l2-fwd-offload: off [fixed]
>>>>> hw-tc-offload: off
>>>>> esp-hw-offload: off [fixed]
>>>>> esp-tx-csum-hw-offload: off [fixed]
>>>>> rx-udp_tunnel-port-offload: on
>>>>> tls-hw-tx-offload: off [fixed]
>>>>> tls-hw-rx-offload: off [fixed]
>>>>> rx-gro-hw: off [fixed]
>>>>> tls-hw-record: off [fixed]
>>>>>
>>>>> ethtool -c enp175s0f0
>>>>> Coalesce parameters for enp175s0f0:
>>>>> Adaptive RX: off TX: on
>>>>> stats-block-usecs: 0
>>>>> sample-interval: 0
>>>>> pkt-rate-low: 0
>>>>> pkt-rate-high: 0
>>>>> dmac: 32703
>>>>>
>>>>> rx-usecs: 256
>>>>> rx-frames: 128
>>>>> rx-usecs-irq: 0
>>>>> rx-frames-irq: 0
>>>>>
>>>>> tx-usecs: 8
>>>>> tx-frames: 128
>>>>> tx-usecs-irq: 0
>>>>> tx-frames-irq: 0
>>>>>
>>>>> rx-usecs-low: 0
>>>>> rx-frame-low: 0
>>>>> tx-usecs-low: 0
>>>>> tx-frame-low: 0
>>>>>
>>>>> rx-usecs-high: 0
>>>>> rx-frame-high: 0
>>>>> tx-usecs-high: 0
>>>>> tx-frame-high: 0
>>>>>
>>>>> ethtool -g enp175s0f0
>>>>> Ring parameters for enp175s0f0:
>>>>> Pre-set maximums:
>>>>> RX: 8192
>>>>> RX Mini: 0
>>>>> RX Jumbo: 0
>>>>> TX: 8192
>>>>> Current hardware settings:
>>>>> RX: 4096
>>>>> RX Mini: 0
>>>>> RX Jumbo: 0
>>>>> TX: 4096
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>> Also changed a little coalesce params - and best for this config are:
>>> ethtool -c enp175s0f0
>>> Coalesce parameters for enp175s0f0:
>>> Adaptive RX: off TX: off
>>> stats-block-usecs: 0
>>> sample-interval: 0
>>> pkt-rate-low: 0
>>> pkt-rate-high: 0
>>> dmac: 32573
>>>
>>> rx-usecs: 40
>>> rx-frames: 128
>>> rx-usecs-irq: 0
>>> rx-frames-irq: 0
>>>
>>> tx-usecs: 8
>>> tx-frames: 8
>>> tx-usecs-irq: 0
>>> tx-frames-irq: 0
>>>
>>> rx-usecs-low: 0
>>> rx-frame-low: 0
>>> tx-usecs-low: 0
>>> tx-frame-low: 0
>>>
>>> rx-usecs-high: 0
>>> rx-frame-high: 0
>>> tx-usecs-high: 0
>>> tx-frame-high: 0
>>>
>>>
>>> Less drops on RX side - and more pps in overall forwarded.
>>>
>> how much improvement ? maybe we can improve our adaptive rx coal to be
>> efficient for this work load.
>>
>>
>
>
Powered by blists - more mailing lists