[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230731174716.0898ff62@kernel.org>
Date: Mon, 31 Jul 2023 17:47:16 -0700
From: Jakub Kicinski <kuba@...nel.org>
To: Thinh Tran <thinhtr@...ux.vnet.ibm.com>
Cc: aelior@...vell.com, davem@...emloft.net, edumazet@...gle.com,
manishc@...vell.com, netdev@...r.kernel.org, pabeni@...hat.com,
skalluru@...vell.com, drc@...ux.vnet.ibm.com, abdhalee@...ibm.com,
simon.horman@...igine.com
Subject: Re: [PATCH v4] bnx2x: Fix error recovering in switch configuration
On Fri, 28 Jul 2023 16:11:33 -0500 Thinh Tran wrote:
> As the BCM57810 and other I/O adapters are connected
> through a PCIe switch, the bnx2x driver causes unexpected
> system hang/crash while handling PCIe switch errors, if
> its error handler is called after other drivers' handlers.
>
> In this case, after numbers of bnx2x_tx_timout(), the
> bnx2x_nic_unload() is called, frees up resources and
> calls bnx2x_napi_disable(). Then when EEH calls its
> error handler, the bnx2x_io_error_detected() and
> bnx2x_io_slot_reset() also calling bnx2x_napi_disable()
> and freeing the resources.
>
>
> Signed-off-by: Thinh Tran <thinhtr@...ux.vnet.ibm.com>
> Reviewed-by: Manish Chopra <manishc@...vell.com>
> Tested-by: Abdul Haleem <abdhalee@...ibm.com>
> Tested-by: David Christensen <drc@...ux.vnet.ibm.com>
>
> Reviewed-by: Simon Horman <simon.horman@...igine.com>
nit: no empty lines between tags
There should be a "---" line between the tags and changelog.
> v4:
> - factoring common code into new function bnx2x_stop_nic()
> that disables and releases IRQs and NAPIs
> v3:
> - no changes, just repatched to the latest driver level
> - updated the reviewed-by Manish in October, 2022
>
> v2:
> - Check the state of the NIC before calling disable nappi
> and freeing the IRQ
> - Prevent recurrence of TX timeout by turning off the carrier,
> calling netif_carrier_off() in bnx2x_tx_timeout()
> - Check and bail out early if fp->page_pool already freed
>
> ---
> drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 2 ++
> @@ -3095,14 +3097,8 @@ int bnx2x_nic_unload(struct bnx2x *bp, int unload_mode, bool keep_link)
> if (!CHIP_IS_E1x(bp))
> bnx2x_pf_disable(bp);
>
> - /* Disable HW interrupts, NAPI */
> - bnx2x_netif_stop(bp, 1);
> - /* Delete all NAPI objects */
> - bnx2x_del_all_napi(bp);
> - if (CNIC_LOADED(bp))
> - bnx2x_del_all_napi_cnic(bp);
> - /* Release IRQs */
> - bnx2x_free_irq(bp);
Could you split the change into two patches - one factoring out the
code into bnx2x_stop_nic() and the other adding the nic_stopped
variable? First one should be pure code refactoring with no functional
changes. That'd make the reviewing process easier.
> + /* Disable HW interrupts, delete NAPIs, Release IRQs */
> + bnx2x_stop_nic(bp);
>
> /* Report UNLOAD_DONE to MCP */
> bnx2x_send_unload_done(bp, false);
> @@ -4987,6 +4983,12 @@ void bnx2x_tx_timeout(struct net_device *dev, unsigned int txqueue)
> {
> struct bnx2x *bp = netdev_priv(dev);
>
> + /* Immediately indicate link as down */
> + bp->link_vars.link_up = 0;
> + bp->force_link_down = true;
> + netif_carrier_off(dev);
> + BNX2X_ERR("Indicating link is down due to Tx-timeout\n");
Is this code move to make the shutdown more immediate?
That could also be a separate patch.
> /* We want the information of the dump logged,
> * but calling bnx2x_panic() would kill all chances of recovery.
> */
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
> index d8b1824c334d..f5ecbe8d604a 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
> @@ -1015,6 +1015,9 @@ static inline void bnx2x_free_rx_sge_range(struct bnx2x *bp,
> {
> int i;
>
> + if (!fp->page_pool.page)
> + return;
> +
> if (fp->mode == TPA_MODE_DISABLED)
> return;
>
> @@ -1399,5 +1402,20 @@ void bnx2x_set_os_driver_state(struct bnx2x *bp, u32 state);
> */
> int bnx2x_nvram_read(struct bnx2x *bp, u32 offset, u8 *ret_buf,
> int buf_size);
> +static inline void bnx2x_stop_nic(struct bnx2x *bp)
can't it live in bnx2x_cmn.c ? Why make it a static inline?
> +{
> + if (!bp->nic_stopped) {
> + /* Disable HW interrupts, NAPI */
> + bnx2x_netif_stop(bp, 1);
> + /* Delete all NAPI objects */
> + bnx2x_del_all_napi(bp);
> + if (CNIC_LOADED(bp))
> + bnx2x_del_all_napi_cnic(bp);
> + /* Release IRQs */
> + bnx2x_free_irq(bp);
> + bp->nic_stopped = true;
> + }
> +}
> +
>
nit: double new line
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
> @@ -529,13 +529,8 @@ void bnx2x_vfpf_close_vf(struct bnx2x *bp)
> bnx2x_vfpf_finalize(bp, &req->first_tlv);
>
> free_irq:
> - /* Disable HW interrupts, NAPI */
> - bnx2x_netif_stop(bp, 0);
This used to say
bnx2x_netif_stop(bp, 0);
but bnx2x_stop_nic() will do:
bnx2x_netif_stop(bp, 1);
is it okay to shut down the HW here ? (whatever that entails)
> - /* Delete all NAPI objects */
> - bnx2x_del_all_napi(bp);
> -
> - /* Release IRQs */
> - bnx2x_free_irq(bp);
> + /* Disable HW interrupts, delete NAPIs, Release IRQs */
> + bnx2x_stop_nic(bp);
--
pw-bot: cr
Powered by blists - more mailing lists