lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Wed, 28 Sep 2022 16:33:43 -0500 From: Thinh Tran <thinhtr@...ux.vnet.ibm.com> To: Manish Chopra <manishc@...vell.com>, "kuba@...nel.org" <kuba@...nel.org> Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, Ariel Elior <aelior@...vell.com>, "davem@...emloft.net" <davem@...emloft.net>, Sudarsana Reddy Kalluru <skalluru@...vell.com>, "pabeni@...hat.com" <pabeni@...hat.com>, "edumazet@...gle.com" <edumazet@...gle.com>, Alok Prasad <palok@...vell.com> Subject: Re: [EXT] [PATCH v2] bnx2x: Fix error recovering in switch configuration Hi Manish, I think we need the extra variable, nic_stopped, or perhaps defining a different state to determine the disablement of NAPI and freed IRQs. Since device gets the error, the NIC state is set to BNX2X_STATE_CLOSING_WAIT4_HALT (0x4000) 14132 static int bnx2x_eeh_nic_unload(struct bnx2x *bp) 14133 { 14134 bp->state = BNX2X_STATE_CLOSING_WAIT4_HALT; 14135 and stays in this state in EEH handler and driver reset code paths, when both trying to disabling the NAPI. On 9/21/2022 3:11 PM, Thinh Tran wrote: > Hi Manish, > Thanks for reviewing the patch. > > > On 9/19/2022 8:53 AM, Manish Chopra wrote: >> Hello Thinh, >> >>> -----Original Message----- >>> From: Thinh Tran <thinhtr@...ux.vnet.ibm.com> >>> Sent: Saturday, September 17, 2022 1:21 AM >>> To: kuba@...nel.org >>> Cc: netdev@...r.kernel.org; Ariel Elior <aelior@...vell.com>; >>> davem@...emloft.net; Manish Chopra <manishc@...vell.com>; Sudarsana >>> Reddy Kalluru <skalluru@...vell.com>; pabeni@...hat.com; >>> edumazet@...gle.com; Thinh Tran <thinhtr@...ux.vnet.ibm.com> >>> Subject: [EXT] [PATCH v2] bnx2x: Fix error recovering in switch >>> configuration >>> >>> External Email >>> >>> ---------------------------------------------------------------------- >>> As the BCM57810 and other I/O adapters are connected through a PCIe >>> switch, the bnx2x driver causes unexpected system hang/crash while >>> handling >>> PCIe switch errors, if its error handler is called after other >>> drivers' handlers. >>> >>> In this case, after numbers of bnx2x_tx_timout(), the >>> bnx2x_nic_unload() is called, frees up resources and calls >>> bnx2x_napi_disable(). Then when EEH calls its error handler, the >>> bnx2x_io_error_detected() and >>> bnx2x_io_slot_reset() also calling bnx2x_napi_disable() and freeing the >>> resources. >>> >>> Signed-off-by: Thinh Tran <thinhtr@...ux.vnet.ibm.com> >>> >>> v2: >>> - Check the state of the NIC before calling disable nappi >>> and freeing the IRQ >>> - Prevent recurrence of TX timeout by turning off the carrier, >>> calling netif_carrier_off() in bnx2x_tx_timeout() >>> - Check and bail out early if fp->page_pool already freed >>> --- >>> drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 2 + >>> .../net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 27 +++++++++---- >>> .../net/ethernet/broadcom/bnx2x/bnx2x_cmn.h | 3 ++ >>> .../net/ethernet/broadcom/bnx2x/bnx2x_main.c | 38 +++++++++---------- >>> .../net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c | 17 +++++---- >>> 5 files changed, 53 insertions(+), 34 deletions(-) >>> >>> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h >>> b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h >>> index dd5945c4bfec..11280f0eb75b 100644 >>> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h >>> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h >>> @@ -1509,6 +1509,8 @@ struct bnx2x { >>> bool cnic_loaded; >>> struct cnic_eth_dev *(*cnic_probe)(struct net_device *); >>> >>> + bool nic_stopped; >>> + >> >> There is an already 'state' variable which holds the NIC state whether >> it was opened or closed. > Perhaps we could use that easily in >> bnx2x_io_slot_reset() to prevent > multiple NAPI disablement >> instead of adding a newer one. Please see below for ex - >> > In my early debug, I did checked for bp->state variable but I was unsure > what the NIC state was safe to disable the NAPI. > Thanks for the hint. > >> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c >> b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c >> index 962253db25b8..c8e9b47208ed 100644 >> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c >> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c >> @@ -14256,13 +14256,16 @@ static pci_ers_result_t >> bnx2x_io_slot_reset(struct pci_dev *pdev) >> } >> bnx2x_drain_tx_queues(bp); >> bnx2x_send_unload_req(bp, UNLOAD_RECOVERY); >> - bnx2x_netif_stop(bp, 1); >> - bnx2x_del_all_napi(bp); >> >> - if (CNIC_LOADED(bp)) >> - bnx2x_del_all_napi_cnic(bp); >> + if (bp->state == BNX2X_STATE_OPEN) { >> + bnx2x_netif_stop(bp, 1); >> + bnx2x_del_all_napi(bp); >> >> - bnx2x_free_irq(bp); >> + if (CNIC_LOADED(bp)) >> + bnx2x_del_all_napi_cnic(bp); >> + >> + bnx2x_free_irq(bp); >> + } >> >> /* Report UNLOAD_DONE to MCP */ >> bnx2x_send_unload_done(bp, true); >> Using the v2 patch, I checked the NIC state at places before and after disabling the NAPI, BNX2X_ERR("before state flag = 0x%x \n", bp->state); if (!bp->nic_stopped) { BNX2X_ERR("after state flag = 0x%x \n", bp->state); bnx2x_netif_stop(bp, 1); bnx2x_del_all_napi(bp); .... the NIC state is always BNX2X_STATE_CLOSING_WAIT4_HALT (0x4000) when device get error injection. 1- When EEH handler code path was called first, the bnx2x_io_slot_reset() does disablement the NAPI and done. [28197.100795] bnx2x: [bnx2x_io_slot_reset:14212(eth3)]IO slot reset initializing... [28197.100959] bnx2x 0201:50:00.0: enabling device (0140 -> 0142) [28197.107249] bnx2x: [bnx2x_io_slot_reset:14228(eth3)]IO slot reset --> driver unload [28199.106281] bnx2x: [bnx2x_clean_tx_queue:1205(eth3)]timeout waiting for queue[1]: txdata->tx_pkt_prod(7270) != txdata->tx_pkt_cons(7269) [28201.106281] bnx2x: [bnx2x_clean_tx_queue:1205(eth3)]timeout waiting for queue[3]: txdata->tx_pkt_prod(10977) != txdata->tx_pkt_cons(10969) [28203.106280] bnx2x: [bnx2x_clean_tx_queue:1205(eth3)]timeout waiting for queue[4]: txdata->tx_pkt_prod(9532) != txdata->tx_pkt_cons(9514) [28205.106284] bnx2x: [bnx2x_clean_tx_queue:1205(eth3)]timeout waiting for queue[6]: txdata->tx_pkt_prod(2359) != txdata->tx_pkt_cons(2331) [28205.314280] bnx2x: [bnx2x_io_slot_reset:14246(eth3)]before -- state flag = 0x4000 <---- check NIC state [28205.314282] bnx2x: [bnx2x_io_slot_reset:14248(eth3)]after -- state flag = 0x4000 <---- check NIC state and disable the NAPI [28205.405009] PCI 0201:50:00.0#500000: EEH: bnx2x driver reports: 'recovered' ........ snip ...... [28213.525014] EEH: Finished:'slot_reset' with aggregate recovery state:'recovered' [28213.525016] EEH: Notify device driver to resume [28213.525017] EEH: Beginning: 'resume' [28213.525018] PCI 0201:50:00.0#500000: EEH: Invoking bnx2x->resume() [28225.115287] bnx2x 0201:50:00.0 eth3: using MSI-X IRQs: sp 101 fp[0] 103 ... fp[7] 110 [28225.696425] PCI 0201:50:00.0#500000: EEH: bnx2x driver reports: 'none' 2- When the driver reset code path was called first, the bnx2x_chip_cleanup() does the disablement the NAPI, but NIC is resumed after the EEH handler code path was called. [13376.774278] bnx2x: [bnx2x_state_wait:312(eth3)]timeout waiting for state 2 [13376.774280] bnx2x: [bnx2x_func_stop:9118(eth3)]FUNC_STOP ramrod failed. Running a dry transaction [13376.774339] bnx2x: [bnx2x_chip_cleanup:9467(eth3)]before state flag = 0x4000 [13376.774341] bnx2x: [bnx2x_chip_cleanup:9469(eth3)]after state flag = 0x4000 [13376.774344] bnx2x: [bnx2x_igu_int_disable:891(eth3)]BUG! Proper val not read from IGU! [13391.774280] bnx2x: [bnx2x_fw_command:3044(eth3)]FW failed to respond! ........ snip ..... [13442.118385] PCI 0201:50:00.0#500000: EEH: Invoking bnx2x->slot_reset() [13442.118388] bnx2x: [bnx2x_io_slot_reset:14212(eth3)]IO slot reset initializing... [13442.118509] bnx2x 0201:50:00.0: enabling device (0140 -> 0142) [13442.124825] bnx2x: [bnx2x_io_slot_reset:14228(eth3)]IO slot reset --> driver unload [13442.154290] bnx2x: [bnx2x_io_slot_reset:14246(eth3)]before -- state flag = 0x4000 <--- check NIC state [13442.274291] PCI 0201:50:00.0#500000: EEH: bnx2x driver reports: 'recovered' ....... snip ......... [13442.424290] EEH: Finished:'slot_reset' with aggregate recovery state:'recovered' [13442.424291] EEH: Notify device driver to resume [13442.424292] EEH: Beginning: 'resume' [13442.424293] PCI 0201:50:00.0#500000: EEH: Invoking bnx2x->resume() [13443.015280] bnx2x 0201:50:00.0 eth3: using MSI-X IRQs: sp 61 fp[0] 63 ... fp[7] 75 [13443.616423] PCI 0201:50:00.0#500000: EEH: bnx2x driver reports: 'none' >> Thanks, >> Manish > > It requires a specific system setup, I will test with v3 patch when the > system is available. > > Thanks, > Thinh Tran Any suggestions? Thanks, Thinh Tran
Powered by blists - more mailing lists