[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACKFLimyFoPS4H9j+L+p26uiPXCuz1rY5ihVKmJ++SgCE8i4fg@mail.gmail.com>
Date: Fri, 1 Dec 2023 08:50:47 -0800
From: Michael Chan <michael.chan@...adcom.com>
To: Thinh Tran <thinhtr@...ux.vnet.ibm.com>
Cc: davem@...emloft.net, drc@...ux.vnet.ibm.com, edumazet@...gle.com,
kuba@...nel.org, mchan@...adcom.com, netdev@...r.kernel.org,
pabeni@...hat.com, pavan.chebbi@...adcom.com, prashant@...adcom.com,
siva.kallam@...adcom.com, Venkata Sai Duggi <venkata.sai.duggi@....com>
Subject: Re: [PATCH v4] net/tg3: fix race condition in tg3_reset_task()
On Thu, Nov 30, 2023 at 4:19 PM Thinh Tran <thinhtr@...ux.vnet.ibm.com> wrote:
>
> When an EEH error is encountered by a PCI adapter, the EEH driver
> modifies the PCI channel's state as shown below:
>
> enum {
> /* I/O channel is in normal state */
> pci_channel_io_normal = (__force pci_channel_state_t) 1,
>
> /* I/O to channel is blocked */
> pci_channel_io_frozen = (__force pci_channel_state_t) 2,
>
> /* PCI card is dead */
> pci_channel_io_perm_failure = (__force pci_channel_state_t) 3,
> };
>
> If the same EEH error then causes the tg3 driver's transmit timeout
> logic to execute, the tg3_tx_timeout() function schedules a reset
> task via tg3_reset_task_schedule(), which may cause a race condition
> between the tg3 and EEH driver as both attempt to recover the HW via
> a reset action.
>
> EEH driver gets error event
> --> eeh_set_channel_state()
> and set device to one of
> error state above scheduler: tg3_reset_task() get
> returned error from tg3_init_hw()
> --> dev_close() shuts down the interface
> tg3_io_slot_reset() and
> tg3_io_resume() fail to
> reset/resume the device
>
> To resolve this issue, we avoid the race condition by checking the PCI
> channel state in the tg3_reset_task() function and skip the tg3 driver
> initiated reset when the PCI channel is not in the normal state. (The
> driver has no access to tg3 device registers at this point and cannot
> even complete the reset task successfully without external assistance.)
> We'll leave the reset procedure to be managed by the EEH driver which
> calls the tg3_io_error_detected(), tg3_io_slot_reset() and
> tg3_io_resume() functions as appropriate.
>
> Adding the same checking in tg3_dump_state() to avoid dumping all
> device registers when the PCI channel is not in the normal state.
>
>
> Signed-off-by: Thinh Tran <thinhtr@...ux.vnet.ibm.com>
> Tested-by: Venkata Sai Duggi <venkata.sai.duggi@....com>
> Reviewed-by: David Christensen <drc@...ux.vnet.ibm.com>
>
> v4: moving the PCI error checking to tg3_reset_task() and
> tg3_dump_state()
> v3: re-post the patch.
> v2: checking PCI errors in tg3_tx_timeout()
Thanks.
Reviewed-by: Michael Chan <michael.chan@...adcom.com>
Download attachment "smime.p7s" of type "application/pkcs7-signature" (4209 bytes)
Powered by blists - more mailing lists