netdev - Re: [PATCH v3] net/tg3: fix race condition in tg3_reset

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <cf143e97-5303-42e3-8a27-3226a6bf7c9b@linux.vnet.ibm.com>
Date: Fri, 17 Nov 2023 10:19:34 -0600
From: Thinh Tran <thinhtr@...ux.vnet.ibm.com>
To: Michael Chan <michael.chan@...adcom.com>
Cc: mchan@...adcom.com, pavan.chebbi@...adcom.com, netdev@...r.kernel.org,
        prashant@...adcom.com, siva.kallam@...adcom.com,
        drc@...ux.vnet.ibm.com, venkata.sai.duggi@....com, edumazet@...gle.com,
        kuba@...nel.org, pabeni@...hat.com, davem@...emloft.net
Subject: Re: [PATCH v3] net/tg3: fix race condition in tg3_reset_task()

On 11/16/2023 3:34 PM, Michael Chan wrote:
> 
> I think it will be better to add these 2 checks in tg3_reset_task().
> We are already doing a similar check there for tp->pcierr_recovery so
> it is better to centralize all the related checks in the same place.
> I don't think tg3_dump_state() below will cause any problems.  We'll
> probably read 0xffffffff for all registers and it will actually
> confirm that the registers are not readable.
> 

I'm concerned that race conditions could still occur during the handling 
of Partitionable Endpoint (PE) reset by the EEH driver. The issue lies 
in the dependency on the lower-level FW reset procedure. When the driver 
executes tg3_dump_state(), and then follows it with tg3_reset_task(), 
the EEH driver possibility changes in the PCI devices' state, but the 
MMIO and DMA reset procedures may not have completed yet. Leading to a 
crash in tg3_reset_task().  This patch tries to prevent that scenario.

While tg3_dump_state() is helpful, it also poses issues as it takes a 
considerable amount of time, approximately 1 or 2 seconds per device. 
Given our 4-port adapter, this could extend to more than 10 seconds to 
write to the dmesg buffer. With the default size, the dmesg buffer may 
be over-written and not retain all useful information.

Thanks,
Thinh Tran