[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACKFLimX4Pjm89cneeTa36B519DN3mdXXo5FXfDFi6e0SBwUSA@mail.gmail.com>
Date: Thu, 2 Nov 2023 10:27:27 -0700
From: Michael Chan <michael.chan@...adcom.com>
To: Thinh Tran <thinhtr@...ux.vnet.ibm.com>
Cc: netdev@...r.kernel.org, siva.kallam@...adcom.com, prashant@...adcom.com,
mchan@...adcom.com, pavan.chebbi@...adcom.com, drc@...ux.vnet.ibm.com,
venkata.sai.duggi@....com
Subject: Re: [PATCH v2] net/tg3: fix race condition in tg3_reset_task()
On Thu, Nov 2, 2023 at 9:16 AM Thinh Tran <thinhtr@...ux.vnet.ibm.com> wrote:
>
> When an EEH error is encountered by a PCI adapter, the EEH driver
> modifies the PCI channel's state as shown below:
>
> enum {
> /* I/O channel is in normal state */
> pci_channel_io_normal = (__force pci_channel_state_t) 1,
>
> /* I/O to channel is blocked */
> pci_channel_io_frozen = (__force pci_channel_state_t) 2,
>
> /* PCI card is dead */
> pci_channel_io_perm_failure = (__force pci_channel_state_t) 3,
> };
>
> If the same EEH error then causes the tg3 driver's transmit timeout
> logic to execute, the tg3_tx_timeout() function schedules a reset
> task via tg3_reset_task_schedule(), which may cause a race condition
> between the tg3 and EEH driver as both attempt to recover the HW via
> a reset action.
>
> EEH driver gets error event
> --> eeh_set_channel_state()
> and set device to one of
> error state above scheduler: tg3_reset_task() get
> returned error from tg3_init_hw()
> --> dev_close() shuts down the interface
>
> tg3_io_slot_reset() and
> tg3_io_resume() fail to
> reset/resume the device
>
>
> To resolve this issue, we avoid the race condition by checking the PCI
> channel state in the tg3_tx_timeout() function and skip the tg3 driver
> initiated reset when the PCI channel is not in the normal state. (The
> driver has no access to tg3 device registers at this point and cannot
> even complete the reset task successfully without external assistance.)
> We'll leave the reset procedure to be managed by the EEH driver which
> calls the tg3_io_error_detected(), tg3_io_slot_reset() and
> tg3_io_resume() functions as appropriate.
This scenario can affect other drivers too, right? Shouldn't this be
handled in a higher layer before calling ->ndo_tx_timeout() so we
don't have to add this logic to all the other drivers? Thanks.
Download attachment "smime.p7s" of type "application/pkcs7-signature" (4209 bytes)
Powered by blists - more mailing lists