lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Wed, 16 Aug 2023 20:51:25 +0200
From: Jinpu Wang <jinpu.wang@...os.com>
Cc: netdev <netdev@...r.kernel.org>
Subject: [RFC] bnxt_en TX timeout detected, starting reset task, flapping link after

Hi Michael, and folks on the list.

We hit two case on two server but same kind of configuration with
following symptom,
error started with:
kern.err: Aug 15 12:21:39 ps502b-104 kernel: [325978.631877] bnxt_en
0000:45:00.0 eth0: TX timeout detected, starting reset task!
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251006] bnxt_en
0000:45:00.0 eth0: [0]: tx{fw_ring: 0 prod: 1e7 cons: 1e4}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251009] bnxt_en
0000:45:00.0 eth0: [0]: rx{fw_ring: 1 prod: 135} rx_agg{fw_ring: 9
agg_prod: 31f sw_agg_prod: 31f}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251010] bnxt_en
0000:45:00.0 eth0: [0]: cp{fw_ring: 0 raw_cons: 5a498f}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251012] bnxt_en
0000:45:00.0 eth0: [1]: tx{fw_ring: 1 prod: 190 cons: 190}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251013] bnxt_en
0000:45:00.0 eth0: [1]: rx{fw_ring: 2 prod: ae} rx_agg{fw_ring: 10
agg_prod: cb sw_agg_prod: cb}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251014] bnxt_en
0000:45:00.0 eth0: [1]: cp{fw_ring: 16 raw_cons: 644dda}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251015] bnxt_en
0000:45:00.0 eth0: [2]: tx{fw_ring: 2 prod: af cons: 9b}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251016] bnxt_en
0000:45:00.0 eth0: [2]: rx{fw_ring: 3 prod: 1b2} rx_agg{fw_ring: 11
agg_prod: 41c sw_agg_prod: 41c}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251017] bnxt_en
0000:45:00.0 eth0: [2]: cp{fw_ring: 17 raw_cons: 517b28}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251018] bnxt_en
0000:45:00.0 eth0: [3]: tx{fw_ring: 3 prod: 5e cons: 5e}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251020] bnxt_en
0000:45:00.0 eth0: [3]: rx{fw_ring: 4 prod: f8} rx_agg{fw_ring: 12
agg_prod: 19d sw_agg_prod: 19d}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251021] bnxt_en
0000:45:00.0 eth0: [3]: cp{fw_ring: 18 raw_cons: 5283a5}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251022] bnxt_en
0000:45:00.0 eth0: [4]: tx{fw_ring: 4 prod: d4 cons: d2}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251023] bnxt_en
0000:45:00.0 eth0: [4]: rx{fw_ring: 5 prod: 185} rx_agg{fw_ring: 13
agg_prod: 34f sw_agg_prod: 34f}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251024] bnxt_en
0000:45:00.0 eth0: [4]: cp{fw_ring: 19 raw_cons: 4dc622}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251024] bnxt_en
0000:45:00.0 eth0: [5]: tx{fw_ring: 5 prod: fd cons: fd}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251026] bnxt_en
0000:45:00.0 eth0: [5]: rx{fw_ring: 6 prod: 177} rx_agg{fw_ring: 14
agg_prod: 47 sw_agg_prod: 47}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251026] bnxt_en
0000:45:00.0 eth0: [5]: cp{fw_ring: 20 raw_cons: 770efd}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251027] bnxt_en
0000:45:00.0 eth0: [6]: tx{fw_ring: 6 prod: 63 cons: 120}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251028] bnxt_en
0000:45:00.0 eth0: [6]: rx{fw_ring: 7 prod: 77} rx_agg{fw_ring: 15
agg_prod: 7aa sw_agg_prod: 7aa}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251029] bnxt_en
0000:45:00.0 eth0: [6]: cp{fw_ring: 21 raw_cons: 1a42064}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251030] bnxt_en
0000:45:00.0 eth0: [7]: tx{fw_ring: 7 prod: 179 cons: 179}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251031] bnxt_en
0000:45:00.0 eth0: [7]: rx{fw_ring: 8 prod: 1f8} rx_agg{fw_ring: 16
agg_prod: 785 sw_agg_prod: 786}
kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251032] bnxt_en
0000:45:00.0 eth0: [7]: cp{fw_ring: 22 raw_cons: 9fd6e8}
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326018.695007] bnxt_en
0000:45:00.0 eth0: TX timeout detected, starting reset task!
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326019.874938] bnxt_en
0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326019.884991] bnxt_en
0000:45:00.0 eth0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326020.749461] bnxt_en
0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326020.759434] bnxt_en
0000:45:00.0 eth0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326021.623141] bnxt_en
0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326021.633040] bnxt_en
0000:45:00.0 eth0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326022.495977] bnxt_en
0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326022.505635] bnxt_en
0000:45:00.0 eth0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.368700] bnxt_en
0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.378155] bnxt_en
0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.423938] bnxt_en
0000:45:00.0 eth0: Invalid hwrm seq id 44628
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.433163] bnxt_en
0000:45:00.0 eth0: Invalid hwrm seq id 44636
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.442137] bnxt_en
0000:45:00.0 eth0: Invalid hwrm seq id 44634
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.450759] bnxt_en
0000:45:00.0 eth0: Invalid hwrm seq id 44632
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.459092] bnxt_en
0000:45:00.0 eth0: Invalid hwrm seq id 44630
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326024.323820] bnxt_en
0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326024.331783] bnxt_en
0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326025.194675] bnxt_en
0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326025.202706] bnxt_en
0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326026.066879] bnxt_en
0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326026.074397] bnxt_en
0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326026.937134] bnxt_en
0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326026.944341] bnxt_en
0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326027.806175] bnxt_en
0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326027.813006] bnxt_en
0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326028.676645] bnxt_en
0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326028.683506] bnxt_en
0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326030.595581] bnxt_en
0000:45:00.0 eth0: Invalid hwrm seq id 44644
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326030.602540] bnxt_en
0000:45:00.0 eth0: Invalid hwrm seq id 44642
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326030.609230] bnxt_en
0000:45:00.0 eth0: Invalid hwrm seq id 44650
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326030.615772] bnxt_en
0000:45:00.0 eth0: Invalid hwrm seq id 44640
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326030.622094] bnxt_en
0000:45:00.0 eth0: Invalid hwrm seq id 44648
kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326030.628320] bnxt_en
0000:45:00.0 eth0: Invalid hwrm seq id 44646

it repeated for hours, until hard reboot the machine.
In another cases, even after reboot once there is traffic the error
occured again. we have to disable offload with ethtool, and then the
system stabilizes.

sudo ethtool -K eth0 rx off tx off
Actual changes:
tx-checksum-ipv4: off
tx-checksum-ipv6: off
tx-tcp-segmentation: off [not requested]
tx-tcp6-segmentation: off [not requested]
rx-checksum: off
rx-gro-hw: off [not requested]

Our env:
sudo ethtool -i eth0
driver: bnxt_en
version: 5.10.136-pserver
firmware-version: 218.0.153.0/pkg 218.0.169.0
expansion-rom-version:
bus-info: 0000:45:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

sudo lspci  -vvv -s 45:00.0
45:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416
NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)
DeviceName: Broadcom 10G Ethernet #1
Subsystem: Super Micro Computer Inc BCM57416 NetXtreme-E Dual-Media
10G RDMA Ethernet Controller
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 64
Region 0: Memory at 380d0110000 (64-bit, prefetchable) [size=64K]
Region 2: Memory at 380d0000000 (64-bit, prefetchable) [size=1M]
Region 4: Memory at 380d0122000 (64-bit, prefetchable) [size=8K]
Expansion ROM at b3b40000 [disabled] [size=256K]
Capabilities: [48] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+
Address: 0000000000000000  Data: 0000
Capabilities: [a0] MSI-X: Enable+ Count=255 Masked-
Vector table: BAR=4 offset=00000000
PBA: BAR=4 offset=00000ff0
Capabilities: [ac] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr+ NoSnoop+ FLReset-
MaxPayload 512 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (ok), Width x8 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+
EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+
MalfTLP+ ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 04008001 4000200f 45020000 00000000
Capabilities: [13c v1] Device Serial Number 3c-ec-ef-ff-fe-91-5c-92
Capabilities: [150 v1] Power Budgeting <?>
Capabilities: [160 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
Status: NegoPending- InProgress-
Capabilities: [180 v1] Vendor Specific Information: ID=0000 Rev=0 Len=020 <?>
Capabilities: [1b0 v1] Latency Tolerance Reporting
Max snoop latency: 1048576ns
Max no snoop latency: 1048576ns
Capabilities: [1b8 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [230 v1] Transaction Processing Hints
Interrupt vector mode supported
Device specific mode supported
Steering table in MSI-X table
Capabilities: [300 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [200 v1] Precision Time Measurement
PTMCap: Requester:+ Responder:- Root:-
PTMClockGranularity: Unimplemented
PTMControl: Enabled:- RootSelected:-
PTMEffectiveGranularity: Unknown
Kernel driver in use: bnxt_en
Kernel modules: bnxt_en




I checked git history, but can't find any bugfix related to it. The
internet tells me it could be a
firmware bug, but I can't find firmware from Broadcom site or supermicro site.

Can you please give me some suggestions?

Thx!
Jinpu Wang @ IONOS Cloud

View attachment "bnxt_en.txt" of type "text/plain" (315457 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ