lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <BY3PR18MB4612A8ECAB3E238E9D9E06EBAB299@BY3PR18MB4612.namprd18.prod.outlook.com> Date: Mon, 17 Oct 2022 11:14:06 +0000 From: Manish Chopra <manishc@...vell.com> To: Thinh Tran <thinhtr@...ux.vnet.ibm.com> CC: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, Ariel Elior <aelior@...vell.com>, "davem@...emloft.net" <davem@...emloft.net>, Sudarsana Reddy Kalluru <skalluru@...vell.com>, "pabeni@...hat.com" <pabeni@...hat.com>, "edumazet@...gle.com" <edumazet@...gle.com>, Alok Prasad <palok@...vell.com>, "kuba@...nel.org" <kuba@...nel.org> Subject: RE: [EXT] [PATCH v2] bnx2x: Fix error recovering in switch configuration Hello Thinh, > -----Original Message----- > From: Thinh Tran <thinhtr@...ux.vnet.ibm.com> > Sent: Tuesday, October 11, 2022 3:31 AM > To: Manish Chopra <manishc@...vell.com> > Cc: netdev@...r.kernel.org; Ariel Elior <aelior@...vell.com>; > davem@...emloft.net; Sudarsana Reddy Kalluru <skalluru@...vell.com>; > pabeni@...hat.com; edumazet@...gle.com; Alok Prasad > <palok@...vell.com>; kuba@...nel.org > Subject: Re: [EXT] [PATCH v2] bnx2x: Fix error recovering in switch > configuration > > Hi Manish, > Do you have any other suggestion or using the extra variable, nic_stopped, in > v2 patch is efficient enough to fix the issue? I am fine with this if using existing 'state' var is not a suitable option here. Thanks. > > Thanks, > Thinh Tran > > On 9/28/2022 4:33 PM, Thinh Tran wrote: > > Hi Manish, > > > > I think we need the extra variable, nic_stopped, or perhaps defining a > > different state to determine the disablement of NAPI and freed IRQs. > > Since device gets the error, the NIC state is set to > > BNX2X_STATE_CLOSING_WAIT4_HALT (0x4000) > > > > 14132 static int bnx2x_eeh_nic_unload(struct bnx2x *bp) > > 14133 { > > 14134 bp->state = BNX2X_STATE_CLOSING_WAIT4_HALT; > > 14135 > > > > and stays in this state in EEH handler and driver reset code paths, > > when both trying to disabling the NAPI. > > > > On 9/21/2022 3:11 PM, Thinh Tran wrote: > >> Hi Manish, > >> Thanks for reviewing the patch. > >> > >> > >> On 9/19/2022 8:53 AM, Manish Chopra wrote: > >>> Hello Thinh, > >>> > >>>> -----Original Message----- > >>>> From: Thinh Tran <thinhtr@...ux.vnet.ibm.com> > >>>> Sent: Saturday, September 17, 2022 1:21 AM > >>>> To: kuba@...nel.org > >>>> Cc: netdev@...r.kernel.org; Ariel Elior <aelior@...vell.com>; > >>>> davem@...emloft.net; Manish Chopra <manishc@...vell.com>; > Sudarsana > >>>> Reddy Kalluru <skalluru@...vell.com>; pabeni@...hat.com; > >>>> edumazet@...gle.com; Thinh Tran <thinhtr@...ux.vnet.ibm.com> > >>>> Subject: [EXT] [PATCH v2] bnx2x: Fix error recovering in switch > >>>> configuration > >>>> > >>>> External Email > >>>> > >>>> ------------------------------------------------------------------- > >>>> --- As the BCM57810 and other I/O adapters are connected through a > >>>> PCIe switch, the bnx2x driver causes unexpected system hang/crash > >>>> while handling PCIe switch errors, if its error handler is called > >>>> after other drivers' handlers. > >>>> > >>>> In this case, after numbers of bnx2x_tx_timout(), the > >>>> bnx2x_nic_unload() is called, frees up resources and calls > >>>> bnx2x_napi_disable(). Then when EEH calls its error handler, the > >>>> bnx2x_io_error_detected() and > >>>> bnx2x_io_slot_reset() also calling bnx2x_napi_disable() and freeing > >>>> the resources. > >>>> > >>>> Signed-off-by: Thinh Tran <thinhtr@...ux.vnet.ibm.com> > >>>> > >>>> v2: > >>>> - Check the state of the NIC before calling disable nappi > >>>> and freeing the IRQ > >>>> - Prevent recurrence of TX timeout by turning off the carrier, > >>>> calling netif_carrier_off() in bnx2x_tx_timeout() > >>>> - Check and bail out early if fp->page_pool already freed > >>>> --- > >>>> drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 2 + > >>>> .../net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 27 +++++++++---- > >>>> .../net/ethernet/broadcom/bnx2x/bnx2x_cmn.h | 3 ++ > >>>> .../net/ethernet/broadcom/bnx2x/bnx2x_main.c | 38 > >>>> +++++++++---------- > >>>> .../net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c | 17 +++++---- > >>>> 5 files changed, 53 insertions(+), 34 deletions(-) > >>>> > >>>> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h > >>>> b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h > >>>> index dd5945c4bfec..11280f0eb75b 100644 > >>>> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h > >>>> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h > >>>> @@ -1509,6 +1509,8 @@ struct bnx2x { > >>>> bool cnic_loaded; > >>>> struct cnic_eth_dev *(*cnic_probe)(struct net_device *); > >>>> > >>>> + bool nic_stopped; > >>>> + > >>> > >>> There is an already 'state' variable which holds the NIC state > >>> whether it was opened or closed. > Perhaps we could use that easily > >>> in bnx2x_io_slot_reset() to prevent > >> multiple NAPI disablement > >>> instead of adding a newer one. Please see below for ex - > >>> > >> In my early debug, I did checked for bp->state variable but I was > >> unsure what the NIC state was safe to disable the NAPI. > >> Thanks for the hint. > >> > >>> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c > >>> b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c > >>> index 962253db25b8..c8e9b47208ed 100644 > >>> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c > >>> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c > >>> @@ -14256,13 +14256,16 @@ static pci_ers_result_t > >>> bnx2x_io_slot_reset(struct pci_dev *pdev) > >>> } > >>> bnx2x_drain_tx_queues(bp); > >>> bnx2x_send_unload_req(bp, UNLOAD_RECOVERY); > >>> - bnx2x_netif_stop(bp, 1); > >>> - bnx2x_del_all_napi(bp); > >>> > >>> - if (CNIC_LOADED(bp)) > >>> - bnx2x_del_all_napi_cnic(bp); > >>> + if (bp->state == BNX2X_STATE_OPEN) { > >>> + bnx2x_netif_stop(bp, 1); > >>> + bnx2x_del_all_napi(bp); > >>> > >>> - bnx2x_free_irq(bp); > >>> + if (CNIC_LOADED(bp)) > >>> + bnx2x_del_all_napi_cnic(bp); > >>> + > >>> + bnx2x_free_irq(bp); > >>> + } > >>> > >>> /* Report UNLOAD_DONE to MCP */ > >>> bnx2x_send_unload_done(bp, true); > >>> > > Using the v2 patch, I checked the NIC state at places before and after > > disabling the NAPI, > > BNX2X_ERR("before state flag = 0x%x \n", bp->state); > > if (!bp->nic_stopped) { > > BNX2X_ERR("after state flag = 0x%x \n", bp->state); > > bnx2x_netif_stop(bp, 1); > > bnx2x_del_all_napi(bp); > > .... > > > > the NIC state is always BNX2X_STATE_CLOSING_WAIT4_HALT (0x4000) > when > > device get error injection. > > > > 1- When EEH handler code path was called first, the > > bnx2x_io_slot_reset() does disablement the NAPI and done. > > [28197.100795] bnx2x: [bnx2x_io_slot_reset:14212(eth3)]IO slot > > reset initializing... > > [28197.100959] bnx2x 0201:50:00.0: enabling device (0140 -> 0142) > > [28197.107249] bnx2x: [bnx2x_io_slot_reset:14228(eth3)]IO slot > > reset > > --> driver unload > > [28199.106281] bnx2x: [bnx2x_clean_tx_queue:1205(eth3)]timeout > > waiting for queue[1]: txdata->tx_pkt_prod(7270) != > > txdata->tx_pkt_cons(7269) > > [28201.106281] bnx2x: [bnx2x_clean_tx_queue:1205(eth3)]timeout > > waiting for queue[3]: txdata->tx_pkt_prod(10977) != > > txdata->tx_pkt_cons(10969) > > [28203.106280] bnx2x: [bnx2x_clean_tx_queue:1205(eth3)]timeout > > waiting for queue[4]: txdata->tx_pkt_prod(9532) != > > txdata->tx_pkt_cons(9514) > > [28205.106284] bnx2x: [bnx2x_clean_tx_queue:1205(eth3)]timeout > > waiting for queue[6]: txdata->tx_pkt_prod(2359) != > > txdata->tx_pkt_cons(2331) > > [28205.314280] bnx2x: [bnx2x_io_slot_reset:14246(eth3)]before -- > > state flag = 0x4000 <---- check NIC state > > [28205.314282] bnx2x: [bnx2x_io_slot_reset:14248(eth3)]after -- > > state flag = 0x4000 <---- check NIC state and disable the NAPI > > [28205.405009] PCI 0201:50:00.0#500000: EEH: bnx2x driver reports: > > 'recovered' > > ........ snip ...... > > [28213.525014] EEH: Finished:'slot_reset' with aggregate recovery > > state:'recovered' > > [28213.525016] EEH: Notify device driver to resume > > [28213.525017] EEH: Beginning: 'resume' > > [28213.525018] PCI 0201:50:00.0#500000: EEH: Invoking > > bnx2x->resume() > > [28225.115287] bnx2x 0201:50:00.0 eth3: using MSI-X IRQs: sp 101 > > fp[0] 103 ... fp[7] 110 > > [28225.696425] PCI 0201:50:00.0#500000: EEH: bnx2x driver reports: > > 'none' > > > > > > 2- When the driver reset code path was called first, the > > bnx2x_chip_cleanup() does the disablement the NAPI, but NIC is resumed > > after the EEH handler code path was called. > > [13376.774278] bnx2x: [bnx2x_state_wait:312(eth3)]timeout waiting > > for state 2 > > [13376.774280] bnx2x: [bnx2x_func_stop:9118(eth3)]FUNC_STOP > > ramrod failed. Running a dry transaction > > [13376.774339] bnx2x: [bnx2x_chip_cleanup:9467(eth3)]before state > > flag = 0x4000 > > [13376.774341] bnx2x: [bnx2x_chip_cleanup:9469(eth3)]after state > > flag = 0x4000 > > [13376.774344] bnx2x: [bnx2x_igu_int_disable:891(eth3)]BUG! Proper > > val not read from IGU! > > [13391.774280] bnx2x: [bnx2x_fw_command:3044(eth3)]FW failed to > > respond! > > ........ snip ..... > > [13442.118385] PCI 0201:50:00.0#500000: EEH: Invoking > > bnx2x->slot_reset() > > [13442.118388] bnx2x: [bnx2x_io_slot_reset:14212(eth3)]IO slot > > reset initializing... > > [13442.118509] bnx2x 0201:50:00.0: enabling device (0140 -> 0142) > > [13442.124825] bnx2x: [bnx2x_io_slot_reset:14228(eth3)]IO slot > > reset > > --> driver unload > > [13442.154290] bnx2x: [bnx2x_io_slot_reset:14246(eth3)]before -- > > state flag = 0x4000 <--- check NIC state > > [13442.274291] PCI 0201:50:00.0#500000: EEH: bnx2x driver reports: > > 'recovered' > > ....... snip ......... > > [13442.424290] EEH: Finished:'slot_reset' with aggregate recovery > > state:'recovered' > > [13442.424291] EEH: Notify device driver to resume > > [13442.424292] EEH: Beginning: 'resume' > > [13442.424293] PCI 0201:50:00.0#500000: EEH: Invoking > > bnx2x->resume() > > [13443.015280] bnx2x 0201:50:00.0 eth3: using MSI-X IRQs: sp 61 > > fp[0] 63 ... fp[7] 75 > > [13443.616423] PCI 0201:50:00.0#500000: EEH: bnx2x driver reports: > > 'none' > > > > > >>> Thanks, > >>> Manish > >> > >> It requires a specific system setup, I will test with v3 patch when > >> the system is available. > >> > >> Thanks, > >> Thinh Tran > > > > Any suggestions? > > Thanks, > > Thinh Tran
Powered by blists - more mailing lists