lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACKFLimX4Pjm89cneeTa36B519DN3mdXXo5FXfDFi6e0SBwUSA@mail.gmail.com>
Date: Thu, 2 Nov 2023 10:27:27 -0700
From: Michael Chan <michael.chan@...adcom.com>
To: Thinh Tran <thinhtr@...ux.vnet.ibm.com>
Cc: netdev@...r.kernel.org, siva.kallam@...adcom.com, prashant@...adcom.com, 
	mchan@...adcom.com, pavan.chebbi@...adcom.com, drc@...ux.vnet.ibm.com, 
	venkata.sai.duggi@....com
Subject: Re: [PATCH v2] net/tg3: fix race condition in tg3_reset_task()

On Thu, Nov 2, 2023 at 9:16 AM Thinh Tran <thinhtr@...ux.vnet.ibm.com> wrote:
>
> When an EEH error is encountered by a PCI adapter, the EEH driver
> modifies the PCI channel's state as shown below:
>
>    enum {
>       /* I/O channel is in normal state */
>       pci_channel_io_normal = (__force pci_channel_state_t) 1,
>
>       /* I/O to channel is blocked */
>       pci_channel_io_frozen = (__force pci_channel_state_t) 2,
>
>       /* PCI card is dead */
>       pci_channel_io_perm_failure = (__force pci_channel_state_t) 3,
>    };
>
> If the same EEH error then causes the tg3 driver's transmit timeout
> logic to execute, the tg3_tx_timeout() function schedules a reset
> task via tg3_reset_task_schedule(), which may cause a race condition
> between the tg3 and EEH driver as both attempt to recover the HW via
> a reset action.
>
> EEH driver gets error event
> --> eeh_set_channel_state()
>     and set device to one of
>     error state above           scheduler: tg3_reset_task() get
>                                 returned error from tg3_init_hw()
>                              --> dev_close() shuts down the interface
>
> tg3_io_slot_reset() and
> tg3_io_resume() fail to
> reset/resume the device
>
>
> To resolve this issue, we avoid the race condition by checking the PCI
> channel state in the tg3_tx_timeout() function and skip the tg3 driver
> initiated reset when the PCI channel is not in the normal state.  (The
> driver has no access to tg3 device registers at this point and cannot
> even complete the reset task successfully without external assistance.)
> We'll leave the reset procedure to be managed by the EEH driver which
> calls the tg3_io_error_detected(), tg3_io_slot_reset() and
> tg3_io_resume() functions as appropriate.

This scenario can affect other drivers too, right?  Shouldn't this be
handled in a higher layer before calling ->ndo_tx_timeout() so we
don't have to add this logic to all the other drivers?  Thanks.

Download attachment "smime.p7s" of type "application/pkcs7-signature" (4209 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ