lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <edafdffb-c1d4-4461-abb6-595624dd7710@linux.vnet.ibm.com> Date: Thu, 30 Nov 2023 16:29:34 -0600 From: Thinh Tran <thinhtr@...ux.vnet.ibm.com> To: Michael Chan <michael.chan@...adcom.com> Cc: mchan@...adcom.com, pavan.chebbi@...adcom.com, netdev@...r.kernel.org, prashant@...adcom.com, siva.kallam@...adcom.com, drc@...ux.vnet.ibm.com, venkata.sai.duggi@....com, edumazet@...gle.com, kuba@...nel.org, pabeni@...hat.com, davem@...emloft.net Subject: Re: [PATCH v3] net/tg3: fix race condition in tg3_reset_task() On 11/17/2023 12:31 PM, Michael Chan wrote: > On Fri, Nov 17, 2023 at 8:19 AM Thinh Tran <thinhtr@...ux.vnet.ibm.com> wrote: >> >> >> On 11/16/2023 3:34 PM, Michael Chan wrote: >>> >>> I think it will be better to add these 2 checks in tg3_reset_task(). >>> We are already doing a similar check there for tp->pcierr_recovery so >>> it is better to centralize all the related checks in the same place. >>> I don't think tg3_dump_state() below will cause any problems. We'll >>> probably read 0xffffffff for all registers and it will actually >>> confirm that the registers are not readable. >>> >> >> I'm concerned that race conditions could still occur during the handling >> of Partitionable Endpoint (PE) reset by the EEH driver. The issue lies >> in the dependency on the lower-level FW reset procedure. When the driver >> executes tg3_dump_state(), and then follows it with tg3_reset_task(), >> the EEH driver possibility changes in the PCI devices' state, but the >> MMIO and DMA reset procedures may not have completed yet. Leading to a >> crash in tg3_reset_task(). This patch tries to prevent that scenario. > > It seems fragile if you are relying on such timing. TG3_TX_TIMEOUT is > 5 seconds but the actual time tg3_tx_timeout() is called varies > depending on when the TX queue is stopped. So tg3_tx_timeout() will > be called 5 seconds or more after EEH if there are uncompleted TX > packets but we don't know precisely when. > >> >> While tg3_dump_state() is helpful, it also poses issues as it takes a >> considerable amount of time, approximately 1 or 2 seconds per device. >> Given our 4-port adapter, this could extend to more than 10 seconds to >> write to the dmesg buffer. With the default size, the dmesg buffer may >> be over-written and not retain all useful information. >> > > If tg3_dump_state() is not useful and fills the dmesg log with useless > data, we can add the same check in tg3_dump_state() and skip it. > tg3_dump_state() is also called by tg3_process_error() so we can avoid > dumping useless data if we hit tg3_process_error() during EEH or AER. > > Thanks. I implemented the fix as you suggested and passed the tests. I will send the next version of patch shortly. Thanks.
Powered by blists - more mailing lists