lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Mon, 10 Oct 2022 17:01:27 -0500 From: Thinh Tran <thinhtr@...ux.vnet.ibm.com> To: Manish Chopra <manishc@...vell.com> Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, Ariel Elior <aelior@...vell.com>, "davem@...emloft.net" <davem@...emloft.net>, Sudarsana Reddy Kalluru <skalluru@...vell.com>, "pabeni@...hat.com" <pabeni@...hat.com>, "edumazet@...gle.com" <edumazet@...gle.com>, Alok Prasad <palok@...vell.com>, "kuba@...nel.org" <kuba@...nel.org> Subject: Re: [EXT] [PATCH v2] bnx2x: Fix error recovering in switch configuration Hi Manish, Do you have any other suggestion or using the extra variable, nic_stopped, in v2 patch is efficient enough to fix the issue? Thanks, Thinh Tran On 9/28/2022 4:33 PM, Thinh Tran wrote: > Hi Manish, > > I think we need the extra variable, nic_stopped, or perhaps defining a > different state to determine the disablement of NAPI and freed IRQs. > Since device gets the error, the NIC state is set to > BNX2X_STATE_CLOSING_WAIT4_HALT (0x4000) > > 14132 static int bnx2x_eeh_nic_unload(struct bnx2x *bp) > 14133 { > 14134 bp->state = BNX2X_STATE_CLOSING_WAIT4_HALT; > 14135 > > and stays in this state in EEH handler and driver reset code paths, when > both trying to disabling the NAPI. > > On 9/21/2022 3:11 PM, Thinh Tran wrote: >> Hi Manish, >> Thanks for reviewing the patch. >> >> >> On 9/19/2022 8:53 AM, Manish Chopra wrote: >>> Hello Thinh, >>> >>>> -----Original Message----- >>>> From: Thinh Tran <thinhtr@...ux.vnet.ibm.com> >>>> Sent: Saturday, September 17, 2022 1:21 AM >>>> To: kuba@...nel.org >>>> Cc: netdev@...r.kernel.org; Ariel Elior <aelior@...vell.com>; >>>> davem@...emloft.net; Manish Chopra <manishc@...vell.com>; Sudarsana >>>> Reddy Kalluru <skalluru@...vell.com>; pabeni@...hat.com; >>>> edumazet@...gle.com; Thinh Tran <thinhtr@...ux.vnet.ibm.com> >>>> Subject: [EXT] [PATCH v2] bnx2x: Fix error recovering in switch >>>> configuration >>>> >>>> External Email >>>> >>>> ---------------------------------------------------------------------- >>>> As the BCM57810 and other I/O adapters are connected through a PCIe >>>> switch, the bnx2x driver causes unexpected system hang/crash while >>>> handling >>>> PCIe switch errors, if its error handler is called after other >>>> drivers' handlers. >>>> >>>> In this case, after numbers of bnx2x_tx_timout(), the >>>> bnx2x_nic_unload() is called, frees up resources and calls >>>> bnx2x_napi_disable(). Then when EEH calls its error handler, the >>>> bnx2x_io_error_detected() and >>>> bnx2x_io_slot_reset() also calling bnx2x_napi_disable() and freeing the >>>> resources. >>>> >>>> Signed-off-by: Thinh Tran <thinhtr@...ux.vnet.ibm.com> >>>> >>>> v2: >>>> - Check the state of the NIC before calling disable nappi >>>> and freeing the IRQ >>>> - Prevent recurrence of TX timeout by turning off the carrier, >>>> calling netif_carrier_off() in bnx2x_tx_timeout() >>>> - Check and bail out early if fp->page_pool already freed >>>> --- >>>> drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 2 + >>>> .../net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 27 +++++++++---- >>>> .../net/ethernet/broadcom/bnx2x/bnx2x_cmn.h | 3 ++ >>>> .../net/ethernet/broadcom/bnx2x/bnx2x_main.c | 38 >>>> +++++++++---------- >>>> .../net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c | 17 +++++---- >>>> 5 files changed, 53 insertions(+), 34 deletions(-) >>>> >>>> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h >>>> b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h >>>> index dd5945c4bfec..11280f0eb75b 100644 >>>> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h >>>> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h >>>> @@ -1509,6 +1509,8 @@ struct bnx2x { >>>> bool cnic_loaded; >>>> struct cnic_eth_dev *(*cnic_probe)(struct net_device *); >>>> >>>> + bool nic_stopped; >>>> + >>> >>> There is an already 'state' variable which holds the NIC state >>> whether it was opened or closed. > Perhaps we could use that easily >>> in bnx2x_io_slot_reset() to prevent >> multiple NAPI disablement >>> instead of adding a newer one. Please see below for ex - >>> >> In my early debug, I did checked for bp->state variable but I was >> unsure what the NIC state was safe to disable the NAPI. >> Thanks for the hint. >> >>> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c >>> b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c >>> index 962253db25b8..c8e9b47208ed 100644 >>> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c >>> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c >>> @@ -14256,13 +14256,16 @@ static pci_ers_result_t >>> bnx2x_io_slot_reset(struct pci_dev *pdev) >>> } >>> bnx2x_drain_tx_queues(bp); >>> bnx2x_send_unload_req(bp, UNLOAD_RECOVERY); >>> - bnx2x_netif_stop(bp, 1); >>> - bnx2x_del_all_napi(bp); >>> >>> - if (CNIC_LOADED(bp)) >>> - bnx2x_del_all_napi_cnic(bp); >>> + if (bp->state == BNX2X_STATE_OPEN) { >>> + bnx2x_netif_stop(bp, 1); >>> + bnx2x_del_all_napi(bp); >>> >>> - bnx2x_free_irq(bp); >>> + if (CNIC_LOADED(bp)) >>> + bnx2x_del_all_napi_cnic(bp); >>> + >>> + bnx2x_free_irq(bp); >>> + } >>> >>> /* Report UNLOAD_DONE to MCP */ >>> bnx2x_send_unload_done(bp, true); >>> > Using the v2 patch, I checked the NIC state at places before and after > disabling the NAPI, > BNX2X_ERR("before state flag = 0x%x \n", bp->state); > if (!bp->nic_stopped) { > BNX2X_ERR("after state flag = 0x%x \n", bp->state); > bnx2x_netif_stop(bp, 1); > bnx2x_del_all_napi(bp); > .... > > the NIC state is always BNX2X_STATE_CLOSING_WAIT4_HALT (0x4000) when > device get error injection. > > 1- When EEH handler code path was called first, the > bnx2x_io_slot_reset() does disablement the NAPI and done. > [28197.100795] bnx2x: [bnx2x_io_slot_reset:14212(eth3)]IO slot reset > initializing... > [28197.100959] bnx2x 0201:50:00.0: enabling device (0140 -> 0142) > [28197.107249] bnx2x: [bnx2x_io_slot_reset:14228(eth3)]IO slot reset > --> driver unload > [28199.106281] bnx2x: [bnx2x_clean_tx_queue:1205(eth3)]timeout > waiting for queue[1]: txdata->tx_pkt_prod(7270) != > txdata->tx_pkt_cons(7269) > [28201.106281] bnx2x: [bnx2x_clean_tx_queue:1205(eth3)]timeout > waiting for queue[3]: txdata->tx_pkt_prod(10977) != > txdata->tx_pkt_cons(10969) > [28203.106280] bnx2x: [bnx2x_clean_tx_queue:1205(eth3)]timeout > waiting for queue[4]: txdata->tx_pkt_prod(9532) != > txdata->tx_pkt_cons(9514) > [28205.106284] bnx2x: [bnx2x_clean_tx_queue:1205(eth3)]timeout > waiting for queue[6]: txdata->tx_pkt_prod(2359) != > txdata->tx_pkt_cons(2331) > [28205.314280] bnx2x: [bnx2x_io_slot_reset:14246(eth3)]before -- > state flag = 0x4000 <---- check NIC state > [28205.314282] bnx2x: [bnx2x_io_slot_reset:14248(eth3)]after -- > state flag = 0x4000 <---- check NIC state and disable the NAPI > [28205.405009] PCI 0201:50:00.0#500000: EEH: bnx2x driver reports: > 'recovered' > ........ snip ...... > [28213.525014] EEH: Finished:'slot_reset' with aggregate recovery > state:'recovered' > [28213.525016] EEH: Notify device driver to resume > [28213.525017] EEH: Beginning: 'resume' > [28213.525018] PCI 0201:50:00.0#500000: EEH: Invoking bnx2x->resume() > [28225.115287] bnx2x 0201:50:00.0 eth3: using MSI-X IRQs: sp 101 > fp[0] 103 ... fp[7] 110 > [28225.696425] PCI 0201:50:00.0#500000: EEH: bnx2x driver reports: > 'none' > > > 2- When the driver reset code path was called first, the > bnx2x_chip_cleanup() does the disablement the NAPI, but NIC is resumed > after the EEH handler code path was called. > [13376.774278] bnx2x: [bnx2x_state_wait:312(eth3)]timeout waiting > for state 2 > [13376.774280] bnx2x: [bnx2x_func_stop:9118(eth3)]FUNC_STOP ramrod > failed. Running a dry transaction > [13376.774339] bnx2x: [bnx2x_chip_cleanup:9467(eth3)]before state > flag = 0x4000 > [13376.774341] bnx2x: [bnx2x_chip_cleanup:9469(eth3)]after state > flag = 0x4000 > [13376.774344] bnx2x: [bnx2x_igu_int_disable:891(eth3)]BUG! Proper > val not read from IGU! > [13391.774280] bnx2x: [bnx2x_fw_command:3044(eth3)]FW failed to > respond! > ........ snip ..... > [13442.118385] PCI 0201:50:00.0#500000: EEH: Invoking > bnx2x->slot_reset() > [13442.118388] bnx2x: [bnx2x_io_slot_reset:14212(eth3)]IO slot reset > initializing... > [13442.118509] bnx2x 0201:50:00.0: enabling device (0140 -> 0142) > [13442.124825] bnx2x: [bnx2x_io_slot_reset:14228(eth3)]IO slot reset > --> driver unload > [13442.154290] bnx2x: [bnx2x_io_slot_reset:14246(eth3)]before -- > state flag = 0x4000 <--- check NIC state > [13442.274291] PCI 0201:50:00.0#500000: EEH: bnx2x driver reports: > 'recovered' > ....... snip ......... > [13442.424290] EEH: Finished:'slot_reset' with aggregate recovery > state:'recovered' > [13442.424291] EEH: Notify device driver to resume > [13442.424292] EEH: Beginning: 'resume' > [13442.424293] PCI 0201:50:00.0#500000: EEH: Invoking bnx2x->resume() > [13443.015280] bnx2x 0201:50:00.0 eth3: using MSI-X IRQs: sp 61 > fp[0] 63 ... fp[7] 75 > [13443.616423] PCI 0201:50:00.0#500000: EEH: bnx2x driver reports: > 'none' > > >>> Thanks, >>> Manish >> >> It requires a specific system setup, I will test with v3 patch when >> the system is available. >> >> Thanks, >> Thinh Tran > > Any suggestions? > Thanks, > Thinh Tran
Powered by blists - more mailing lists