lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <20230816200120.603cc65a@kernel.org> Date: Wed, 16 Aug 2023 20:01:20 -0700 From: Jakub Kicinski <kuba@...nel.org> To: Jinpu Wang <jinpu.wang@...os.com> Cc: netdev <netdev@...r.kernel.org>, Michael Chan <michael.chan@...adcom.com> Subject: Re: [RFC] bnxt_en TX timeout detected, starting reset task, flapping link after On Wed, 16 Aug 2023 20:51:25 +0200 Jinpu Wang wrote: > Hi Michael, and folks on the list. It seems you meant to CC Michael.. adding him now. I don't recall anything like this. Could be a bad system... > We hit two case on two server but same kind of configuration with > following symptom, > error started with: > kern.err: Aug 15 12:21:39 ps502b-104 kernel: [325978.631877] bnxt_en > 0000:45:00.0 eth0: TX timeout detected, starting reset task! > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251006] bnxt_en > 0000:45:00.0 eth0: [0]: tx{fw_ring: 0 prod: 1e7 cons: 1e4} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251009] bnxt_en > 0000:45:00.0 eth0: [0]: rx{fw_ring: 1 prod: 135} rx_agg{fw_ring: 9 > agg_prod: 31f sw_agg_prod: 31f} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251010] bnxt_en > 0000:45:00.0 eth0: [0]: cp{fw_ring: 0 raw_cons: 5a498f} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251012] bnxt_en > 0000:45:00.0 eth0: [1]: tx{fw_ring: 1 prod: 190 cons: 190} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251013] bnxt_en > 0000:45:00.0 eth0: [1]: rx{fw_ring: 2 prod: ae} rx_agg{fw_ring: 10 > agg_prod: cb sw_agg_prod: cb} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251014] bnxt_en > 0000:45:00.0 eth0: [1]: cp{fw_ring: 16 raw_cons: 644dda} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251015] bnxt_en > 0000:45:00.0 eth0: [2]: tx{fw_ring: 2 prod: af cons: 9b} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251016] bnxt_en > 0000:45:00.0 eth0: [2]: rx{fw_ring: 3 prod: 1b2} rx_agg{fw_ring: 11 > agg_prod: 41c sw_agg_prod: 41c} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251017] bnxt_en > 0000:45:00.0 eth0: [2]: cp{fw_ring: 17 raw_cons: 517b28} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251018] bnxt_en > 0000:45:00.0 eth0: [3]: tx{fw_ring: 3 prod: 5e cons: 5e} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251020] bnxt_en > 0000:45:00.0 eth0: [3]: rx{fw_ring: 4 prod: f8} rx_agg{fw_ring: 12 > agg_prod: 19d sw_agg_prod: 19d} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251021] bnxt_en > 0000:45:00.0 eth0: [3]: cp{fw_ring: 18 raw_cons: 5283a5} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251022] bnxt_en > 0000:45:00.0 eth0: [4]: tx{fw_ring: 4 prod: d4 cons: d2} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251023] bnxt_en > 0000:45:00.0 eth0: [4]: rx{fw_ring: 5 prod: 185} rx_agg{fw_ring: 13 > agg_prod: 34f sw_agg_prod: 34f} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251024] bnxt_en > 0000:45:00.0 eth0: [4]: cp{fw_ring: 19 raw_cons: 4dc622} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251024] bnxt_en > 0000:45:00.0 eth0: [5]: tx{fw_ring: 5 prod: fd cons: fd} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251026] bnxt_en > 0000:45:00.0 eth0: [5]: rx{fw_ring: 6 prod: 177} rx_agg{fw_ring: 14 > agg_prod: 47 sw_agg_prod: 47} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251026] bnxt_en > 0000:45:00.0 eth0: [5]: cp{fw_ring: 20 raw_cons: 770efd} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251027] bnxt_en > 0000:45:00.0 eth0: [6]: tx{fw_ring: 6 prod: 63 cons: 120} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251028] bnxt_en > 0000:45:00.0 eth0: [6]: rx{fw_ring: 7 prod: 77} rx_agg{fw_ring: 15 > agg_prod: 7aa sw_agg_prod: 7aa} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251029] bnxt_en > 0000:45:00.0 eth0: [6]: cp{fw_ring: 21 raw_cons: 1a42064} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251030] bnxt_en > 0000:45:00.0 eth0: [7]: tx{fw_ring: 7 prod: 179 cons: 179} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251031] bnxt_en > 0000:45:00.0 eth0: [7]: rx{fw_ring: 8 prod: 1f8} rx_agg{fw_ring: 16 > agg_prod: 785 sw_agg_prod: 786} > kern.info: Aug 15 12:22:32 ps502b-104 kernel: [326009.251032] bnxt_en > 0000:45:00.0 eth0: [7]: cp{fw_ring: 22 raw_cons: 9fd6e8} > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326018.695007] bnxt_en > 0000:45:00.0 eth0: TX timeout detected, starting reset task! > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326019.874938] bnxt_en > 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326019.884991] bnxt_en > 0000:45:00.0 eth0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326020.749461] bnxt_en > 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326020.759434] bnxt_en > 0000:45:00.0 eth0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326021.623141] bnxt_en > 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326021.633040] bnxt_en > 0000:45:00.0 eth0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326022.495977] bnxt_en > 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326022.505635] bnxt_en > 0000:45:00.0 eth0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.368700] bnxt_en > 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.378155] bnxt_en > 0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.423938] bnxt_en > 0000:45:00.0 eth0: Invalid hwrm seq id 44628 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.433163] bnxt_en > 0000:45:00.0 eth0: Invalid hwrm seq id 44636 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.442137] bnxt_en > 0000:45:00.0 eth0: Invalid hwrm seq id 44634 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.450759] bnxt_en > 0000:45:00.0 eth0: Invalid hwrm seq id 44632 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326023.459092] bnxt_en > 0000:45:00.0 eth0: Invalid hwrm seq id 44630 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326024.323820] bnxt_en > 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326024.331783] bnxt_en > 0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326025.194675] bnxt_en > 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326025.202706] bnxt_en > 0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326026.066879] bnxt_en > 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326026.074397] bnxt_en > 0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326026.937134] bnxt_en > 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326026.944341] bnxt_en > 0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326027.806175] bnxt_en > 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326027.813006] bnxt_en > 0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326028.676645] bnxt_en > 0000:45:00.0 eth0: Resp cmpl intr err msg: 0x51 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326028.683506] bnxt_en > 0000:45:00.0 eth0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326030.595581] bnxt_en > 0000:45:00.0 eth0: Invalid hwrm seq id 44644 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326030.602540] bnxt_en > 0000:45:00.0 eth0: Invalid hwrm seq id 44642 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326030.609230] bnxt_en > 0000:45:00.0 eth0: Invalid hwrm seq id 44650 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326030.615772] bnxt_en > 0000:45:00.0 eth0: Invalid hwrm seq id 44640 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326030.622094] bnxt_en > 0000:45:00.0 eth0: Invalid hwrm seq id 44648 > kern.err: Aug 15 12:23:33 ps502b-104 kernel: [326030.628320] bnxt_en > 0000:45:00.0 eth0: Invalid hwrm seq id 44646 > > it repeated for hours, until hard reboot the machine. > In another cases, even after reboot once there is traffic the error > occured again. we have to disable offload with ethtool, and then the > system stabilizes. > > sudo ethtool -K eth0 rx off tx off > Actual changes: > tx-checksum-ipv4: off > tx-checksum-ipv6: off > tx-tcp-segmentation: off [not requested] > tx-tcp6-segmentation: off [not requested] > rx-checksum: off > rx-gro-hw: off [not requested] > > Our env: > sudo ethtool -i eth0 > driver: bnxt_en > version: 5.10.136-pserver > firmware-version: 218.0.153.0/pkg 218.0.169.0 > expansion-rom-version: > bus-info: 0000:45:00.0 > supports-statistics: yes > supports-test: yes > supports-eeprom-access: yes > supports-register-dump: yes > supports-priv-flags: no > > sudo lspci -vvv -s 45:00.0 > 45:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 > NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01) > DeviceName: Broadcom 10G Ethernet #1 > Subsystem: Super Micro Computer Inc BCM57416 NetXtreme-E Dual-Media > 10G RDMA Ethernet Controller > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0 > Interrupt: pin A routed to IRQ 64 > Region 0: Memory at 380d0110000 (64-bit, prefetchable) [size=64K] > Region 2: Memory at 380d0000000 (64-bit, prefetchable) [size=1M] > Region 4: Memory at 380d0122000 (64-bit, prefetchable) [size=8K] > Expansion ROM at b3b40000 [disabled] [size=256K] > Capabilities: [48] Power Management version 3 > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME- > Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+ > Address: 0000000000000000 Data: 0000 > Capabilities: [a0] MSI-X: Enable+ Count=255 Masked- > Vector table: BAR=4 offset=00000000 > PBA: BAR=4 offset=00000ff0 > Capabilities: [ac] Express (v2) Endpoint, MSI 00 > DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us > ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W > DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq- > RlxdOrd+ ExtTag+ PhantFunc- AuxPwr+ NoSnoop+ FLReset- > MaxPayload 512 bytes, MaxReadReq 512 bytes > DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend- > LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported > ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ > LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed 8GT/s (ok), Width x8 (ok) > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+ > 10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt- EETLPPrefix- > EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- > FRS- TPHComp- ExtTPHComp- > AtomicOpsCap: 32bit- 64bit- 128bitCAS- > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled, > AtomicOpsCtl: ReqEn- > LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS- > LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- > Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- > Compliance De-emphasis: -6dB > LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ > EqualizationPhase1+ > EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- > Retimer- 2Retimers- CrosslinkRes: unsupported > Capabilities: [100 v1] Advanced Error Reporting > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- > MalfTLP- ECRC- UnsupReq+ ACSViol- > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- > MalfTLP- ECRC- UnsupReq+ ACSViol- > UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ > MalfTLP+ ECRC+ UnsupReq- ACSViol- > CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ > CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- > AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+ > MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- > HeaderLog: 04008001 4000200f 45020000 00000000 > Capabilities: [13c v1] Device Serial Number 3c-ec-ef-ff-fe-91-5c-92 > Capabilities: [150 v1] Power Budgeting <?> > Capabilities: [160 v1] Virtual Channel > Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 > Arb: Fixed- WRR32- WRR64- WRR128- > Ctrl: ArbSelect=Fixed > Status: InProgress- > VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- > Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- > Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01 > Status: NegoPending- InProgress- > Capabilities: [180 v1] Vendor Specific Information: ID=0000 Rev=0 Len=020 <?> > Capabilities: [1b0 v1] Latency Tolerance Reporting > Max snoop latency: 1048576ns > Max no snoop latency: 1048576ns > Capabilities: [1b8 v1] Alternative Routing-ID Interpretation (ARI) > ARICap: MFVC- ACS-, Next Function: 1 > ARICtl: MFVC- ACS-, Function Group: 0 > Capabilities: [230 v1] Transaction Processing Hints > Interrupt vector mode supported > Device specific mode supported > Steering table in MSI-X table > Capabilities: [300 v1] Secondary PCI Express > LnkCtl3: LnkEquIntrruptEn- PerformEqu- > LaneErrStat: 0 > Capabilities: [200 v1] Precision Time Measurement > PTMCap: Requester:+ Responder:- Root:- > PTMClockGranularity: Unimplemented > PTMControl: Enabled:- RootSelected:- > PTMEffectiveGranularity: Unknown > Kernel driver in use: bnxt_en > Kernel modules: bnxt_en > > > > > I checked git history, but can't find any bugfix related to it. The > internet tells me it could be a > firmware bug, but I can't find firmware from Broadcom site or supermicro site. > > Can you please give me some suggestions? > > Thx! > Jinpu Wang @ IONOS Cloud
Powered by blists - more mailing lists