lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACKFLimyFoPS4H9j+L+p26uiPXCuz1rY5ihVKmJ++SgCE8i4fg@mail.gmail.com>
Date: Fri, 1 Dec 2023 08:50:47 -0800
From: Michael Chan <michael.chan@...adcom.com>
To: Thinh Tran <thinhtr@...ux.vnet.ibm.com>
Cc: davem@...emloft.net, drc@...ux.vnet.ibm.com, edumazet@...gle.com, 
	kuba@...nel.org, mchan@...adcom.com, netdev@...r.kernel.org, 
	pabeni@...hat.com, pavan.chebbi@...adcom.com, prashant@...adcom.com, 
	siva.kallam@...adcom.com, Venkata Sai Duggi <venkata.sai.duggi@....com>
Subject: Re: [PATCH v4] net/tg3: fix race condition in tg3_reset_task()

On Thu, Nov 30, 2023 at 4:19 PM Thinh Tran <thinhtr@...ux.vnet.ibm.com> wrote:
>
> When an EEH error is encountered by a PCI adapter, the EEH driver
> modifies the PCI channel's state as shown below:
>
>    enum {
>       /* I/O channel is in normal state */
>       pci_channel_io_normal = (__force pci_channel_state_t) 1,
>
>       /* I/O to channel is blocked */
>       pci_channel_io_frozen = (__force pci_channel_state_t) 2,
>
>       /* PCI card is dead */
>       pci_channel_io_perm_failure = (__force pci_channel_state_t) 3,
>    };
>
> If the same EEH error then causes the tg3 driver's transmit timeout
> logic to execute, the tg3_tx_timeout() function schedules a reset
> task via tg3_reset_task_schedule(), which may cause a race condition
> between the tg3 and EEH driver as both attempt to recover the HW via
> a reset action.
>
> EEH driver gets error event
> --> eeh_set_channel_state()
>     and set device to one of
>     error state above           scheduler: tg3_reset_task() get
>                                 returned error from tg3_init_hw()
>                              --> dev_close() shuts down the interface
> tg3_io_slot_reset() and
> tg3_io_resume() fail to
> reset/resume the device
>
> To resolve this issue, we avoid the race condition by checking the PCI
> channel state in the tg3_reset_task() function and skip the tg3 driver
> initiated reset when the PCI channel is not in the normal state.  (The
> driver has no access to tg3 device registers at this point and cannot
> even complete the reset task successfully without external assistance.)
> We'll leave the reset procedure to be managed by the EEH driver which
> calls the tg3_io_error_detected(), tg3_io_slot_reset() and
> tg3_io_resume() functions as appropriate.
>
> Adding the same checking in tg3_dump_state() to avoid dumping all
> device registers when the PCI channel is not in the normal state.
>
>
> Signed-off-by: Thinh Tran <thinhtr@...ux.vnet.ibm.com>
> Tested-by: Venkata Sai Duggi <venkata.sai.duggi@....com>
> Reviewed-by: David Christensen <drc@...ux.vnet.ibm.com>
>
>   v4: moving the PCI error checking to tg3_reset_task() and
>       tg3_dump_state()
>   v3: re-post the patch.
>   v2: checking PCI errors in tg3_tx_timeout()

Thanks.
Reviewed-by: Michael Chan <michael.chan@...adcom.com>

Download attachment "smime.p7s" of type "application/pkcs7-signature" (4209 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ