[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cd2b2456b1853d71b1c84c152164732f3a39f4dc.camel@redhat.com>
Date: Tue, 01 Aug 2023 10:30:52 +0200
From: Paolo Abeni <pabeni@...hat.com>
To: Jakub Kicinski <kuba@...nel.org>, Thinh Tran <thinhtr@...ux.vnet.ibm.com>
Cc: aelior@...vell.com, davem@...emloft.net, edumazet@...gle.com,
manishc@...vell.com, netdev@...r.kernel.org, skalluru@...vell.com,
drc@...ux.vnet.ibm.com, abdhalee@...ibm.com, simon.horman@...igine.com
Subject: Re: [PATCH v4] bnx2x: Fix error recovering in switch configuration
On Mon, 2023-07-31 at 17:47 -0700, Jakub Kicinski wrote:
> On Fri, 28 Jul 2023 16:11:33 -0500 Thinh Tran wrote:
> > As the BCM57810 and other I/O adapters are connected
> > through a PCIe switch, the bnx2x driver causes unexpected
> > system hang/crash while handling PCIe switch errors, if
> > its error handler is called after other drivers' handlers.
> >
> > In this case, after numbers of bnx2x_tx_timout(), the
> > bnx2x_nic_unload() is called, frees up resources and
> > calls bnx2x_napi_disable(). Then when EEH calls its
> > error handler, the bnx2x_io_error_detected() and
> > bnx2x_io_slot_reset() also calling bnx2x_napi_disable()
> > and freeing the resources.
> >
> >
> > Signed-off-by: Thinh Tran <thinhtr@...ux.vnet.ibm.com>
> > Reviewed-by: Manish Chopra <manishc@...vell.com>
> > Tested-by: Abdul Haleem <abdhalee@...ibm.com>
> > Tested-by: David Christensen <drc@...ux.vnet.ibm.com>
> >
> > Reviewed-by: Simon Horman <simon.horman@...igine.com>
>
> nit: no empty lines between tags
>
> There should be a "---" line between the tags and changelog.
>
> > v4:
> > - factoring common code into new function bnx2x_stop_nic()
> > that disables and releases IRQs and NAPIs
> > v3:
> > - no changes, just repatched to the latest driver level
> > - updated the reviewed-by Manish in October, 2022
> >
> > v2:
> > - Check the state of the NIC before calling disable nappi
> > and freeing the IRQ
> > - Prevent recurrence of TX timeout by turning off the carrier,
> > calling netif_carrier_off() in bnx2x_tx_timeout()
> > - Check and bail out early if fp->page_pool already freed
> >
> > ---
> > drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 2 ++
>
> > @@ -3095,14 +3097,8 @@ int bnx2x_nic_unload(struct bnx2x *bp, int unload_mode, bool keep_link)
> > if (!CHIP_IS_E1x(bp))
> > bnx2x_pf_disable(bp);
> >
> > - /* Disable HW interrupts, NAPI */
> > - bnx2x_netif_stop(bp, 1);
> > - /* Delete all NAPI objects */
> > - bnx2x_del_all_napi(bp);
> > - if (CNIC_LOADED(bp))
> > - bnx2x_del_all_napi_cnic(bp);
> > - /* Release IRQs */
> > - bnx2x_free_irq(bp);
>
> Could you split the change into two patches - one factoring out the
> code into bnx2x_stop_nic() and the other adding the nic_stopped
> variable? First one should be pure code refactoring with no functional
> changes. That'd make the reviewing process easier.
>
> > + /* Disable HW interrupts, delete NAPIs, Release IRQs */
> > + bnx2x_stop_nic(bp);
> >
> > /* Report UNLOAD_DONE to MCP */
> > bnx2x_send_unload_done(bp, false);
> > @@ -4987,6 +4983,12 @@ void bnx2x_tx_timeout(struct net_device *dev, unsigned int txqueue)
> > {
> > struct bnx2x *bp = netdev_priv(dev);
> >
> > + /* Immediately indicate link as down */
> > + bp->link_vars.link_up = 0;
> > + bp->force_link_down = true;
> > + netif_carrier_off(dev);
> > + BNX2X_ERR("Indicating link is down due to Tx-timeout\n");
>
> Is this code move to make the shutdown more immediate?
> That could also be a separate patch.
Note that the original code run under the rtnl lock and this is not
lockless, it that safe?
Cheers,
Paolo
Powered by blists - more mailing lists