netdev - RE: [PATCH net] bnx2x: don't reset chip on cleanup if PCI function is offline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CO2PR11MB008852D899DEF786AD7CB48C971D0@CO2PR11MB0088.namprd11.prod.outlook.com>
Date:	Wed, 10 Aug 2016 07:59:55 +0000
From:	Yuval Mintz <Yuval.Mintz@...gic.com>
To:	"Guilherme G. Piccoli" <gpiccoli@...ux.vnet.ibm.com>
CC:	netdev <netdev@...r.kernel.org>,
	Ariel Elior <Ariel.Elior@...gic.com>
Subject: RE: [PATCH net] bnx2x: don't reset chip on cleanup if PCI function is
 offline

> > Why would the published resume()  from pci_error_handlers be called in this
> scenario?
> 
> It isn't. That's why I specifically commented on commit message: "There are two
> cases though that another path is taken on the code".
> 
> The code path reach bnx2x_chip_cleanup() on device removal from the system,
> as seen in the below call trace:
> 
> bnx2x_chip_cleanup+0x3c0/0x910 [bnx2x]
> bnx2x_nic_unload+0x268/0xaf0 [bnx2x]
> bnx2x_close+0x34/0x50 [bnx2x]
> __dev_close_many+0xd4/0x150
> dev_close_many+0xa8/0x160
> rollback_registered_many+0x174/0x3f0
> rollback_registered+0x40/0x70
> unregister_netdevice_queue+0x98/0x110
> unregister_netdev+0x34/0x50
> __bnx2x_remove+0xa8/0x3a0 [bnx2x]
> pci_device_remove+0x70/0x110

Makes sense.

> >> Also, we avoid the MCP information dump in case of non-recoverable
> >> PCI error (when adapter is about to be removed), since it will certainly fail.
> >
> > We should probably avoid several things here; Why specifically only this?
> 
> For example, we shouldn't execute bnx2x_timer() in this scenario. But I thought
> it'd be too much to check every call of a timer function against PCI channel state
> just to avoid it's execution on this scenario, so I just let it execute, since it seems
> harmless.
> 
> >> +	/* Reset the chip, unless PCI function is offline. If we reach this
> >> +	 * point following a PCI error handling, it means device is really
> >> +	 * in a bad state and we're about to remove it, so reset the chip
> >> +	 * is not a good idea.
> >> +	 */
> >> +	if (!pci_channel_offline(bp->pdev)) {
> >> +		rc = bnx2x_reset_hw(bp, reset_code);
> >> +		if (rc)
> >> +			BNX2X_ERR("HW_RESET failed\n");
> >> +	}
> >
> > Why not simply check this at the beginning of the function?
> 
> Because I wasn't sure if I could drop the entire execution of chip_cleanup(). I
> tried to keep the most of this function aiming to shutdown the module in a
> gentle way, like cleaning MAC, stopping queues...but again, I'm open to
> suggestions and gladly will change this in v2 if you think it's for the best.

Problem is I won't be able to have a more thorough review of this in the next
couple of days - and other than code-review I won't have a reasonable way
of testing this [I can use aer_inject, but I don't have your magical EEH
error injections, and I'm not at all certain it would suffice for a good testing ].

I agree that even as-is, what you're suggesting is an improvement to the
existing flow - so it's basically up to dave, i.e., whether to take a half fix
or wait for a more thorough one.