netdev - Re: [PATCH v4] bnx2x: Fix error recovering in switch configuration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <cd2b2456b1853d71b1c84c152164732f3a39f4dc.camel@redhat.com>
Date: Tue, 01 Aug 2023 10:30:52 +0200
From: Paolo Abeni <pabeni@...hat.com>
To: Jakub Kicinski <kuba@...nel.org>, Thinh Tran <thinhtr@...ux.vnet.ibm.com>
Cc: aelior@...vell.com, davem@...emloft.net, edumazet@...gle.com, 
	manishc@...vell.com, netdev@...r.kernel.org, skalluru@...vell.com, 
	drc@...ux.vnet.ibm.com, abdhalee@...ibm.com, simon.horman@...igine.com
Subject: Re: [PATCH v4] bnx2x: Fix error recovering in switch configuration

On Mon, 2023-07-31 at 17:47 -0700, Jakub Kicinski wrote:
> On Fri, 28 Jul 2023 16:11:33 -0500 Thinh Tran wrote:
> > As the BCM57810 and other I/O adapters are connected
> > through a PCIe switch, the bnx2x driver causes unexpected
> > system hang/crash while handling PCIe switch errors, if
> > its error handler is called after other drivers' handlers.
> > 
> > In this case, after numbers of bnx2x_tx_timout(), the
> > bnx2x_nic_unload() is  called, frees up resources and
> > calls bnx2x_napi_disable(). Then when EEH calls its
> > error handler, the bnx2x_io_error_detected() and
> > bnx2x_io_slot_reset() also calling bnx2x_napi_disable()
> > and freeing the resources.
> > 
> > 
> > Signed-off-by: Thinh Tran <thinhtr@...ux.vnet.ibm.com>
> > Reviewed-by: Manish Chopra <manishc@...vell.com>
> > Tested-by: Abdul Haleem <abdhalee@...ibm.com>
> > Tested-by: David Christensen <drc@...ux.vnet.ibm.com>
> > 
> > Reviewed-by: Simon Horman <simon.horman@...igine.com>
> 
> nit: no empty lines between tags
> 
> There should be a "---" line between the tags and changelog.
> 
> >   v4:
> >    - factoring common code into new function bnx2x_stop_nic()
> >      that disables and releases IRQs and NAPIs 
> >   v3:
> >     - no changes, just repatched to the latest driver level
> >     - updated the reviewed-by Manish in October, 2022
> > 
> >   v2:
> >    - Check the state of the NIC before calling disable nappi
> >      and freeing the IRQ
> >    - Prevent recurrence of TX timeout by turning off the carrier,
> >      calling netif_carrier_off() in bnx2x_tx_timeout()
> >    - Check and bail out early if fp->page_pool already freed
> > 
> > ---
> >  drivers/net/ethernet/broadcom/bnx2x/bnx2x.h   |  2 ++
> 
> > @@ -3095,14 +3097,8 @@ int bnx2x_nic_unload(struct bnx2x *bp, int unload_mode, bool keep_link)
> >  		if (!CHIP_IS_E1x(bp))
> >  			bnx2x_pf_disable(bp);
> >  
> > -		/* Disable HW interrupts, NAPI */
> > -		bnx2x_netif_stop(bp, 1);
> > -		/* Delete all NAPI objects */
> > -		bnx2x_del_all_napi(bp);
> > -		if (CNIC_LOADED(bp))
> > -			bnx2x_del_all_napi_cnic(bp);
> > -		/* Release IRQs */
> > -		bnx2x_free_irq(bp);
> 
> Could you split the change into two patches - one factoring out the
> code into bnx2x_stop_nic() and the other adding the nic_stopped
> variable? First one should be pure code refactoring with no functional
> changes. That'd make the reviewing process easier.
> 
> > +		/* Disable HW interrupts, delete NAPIs, Release IRQs */
> > +		bnx2x_stop_nic(bp);
> >  
> >  		/* Report UNLOAD_DONE to MCP */
> >  		bnx2x_send_unload_done(bp, false);
> > @@ -4987,6 +4983,12 @@ void bnx2x_tx_timeout(struct net_device *dev, unsigned int txqueue)
> >  {
> >  	struct bnx2x *bp = netdev_priv(dev);
> >  
> > +	/* Immediately indicate link as down */
> > +	bp->link_vars.link_up = 0;
> > +	bp->force_link_down = true;
> > +	netif_carrier_off(dev);
> > +	BNX2X_ERR("Indicating link is down due to Tx-timeout\n");
> 
> Is this code move to make the shutdown more immediate?
> That could also be a separate patch.

Note that the original code run under the rtnl lock and this is not
lockless, it that safe?

Cheers,

Paolo