lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 17 Nov 2023 10:31:59 -0800
From: Michael Chan <michael.chan@...adcom.com>
To: Thinh Tran <thinhtr@...ux.vnet.ibm.com>
Cc: mchan@...adcom.com, pavan.chebbi@...adcom.com, netdev@...r.kernel.org, 
	prashant@...adcom.com, siva.kallam@...adcom.com, drc@...ux.vnet.ibm.com, 
	venkata.sai.duggi@....com, edumazet@...gle.com, kuba@...nel.org, 
	pabeni@...hat.com, davem@...emloft.net
Subject: Re: [PATCH v3] net/tg3: fix race condition in tg3_reset_task()

On Fri, Nov 17, 2023 at 8:19 AM Thinh Tran <thinhtr@...ux.vnet.ibm.com> wrote:
>
>
> On 11/16/2023 3:34 PM, Michael Chan wrote:
> >
> > I think it will be better to add these 2 checks in tg3_reset_task().
> > We are already doing a similar check there for tp->pcierr_recovery so
> > it is better to centralize all the related checks in the same place.
> > I don't think tg3_dump_state() below will cause any problems.  We'll
> > probably read 0xffffffff for all registers and it will actually
> > confirm that the registers are not readable.
> >
>
> I'm concerned that race conditions could still occur during the handling
> of Partitionable Endpoint (PE) reset by the EEH driver. The issue lies
> in the dependency on the lower-level FW reset procedure. When the driver
> executes tg3_dump_state(), and then follows it with tg3_reset_task(),
> the EEH driver possibility changes in the PCI devices' state, but the
> MMIO and DMA reset procedures may not have completed yet. Leading to a
> crash in tg3_reset_task().  This patch tries to prevent that scenario.

It seems fragile if you are relying on such timing.  TG3_TX_TIMEOUT is
5 seconds but the actual time tg3_tx_timeout() is called varies
depending on when the TX queue is stopped.  So tg3_tx_timeout() will
be called 5 seconds or more after EEH if there are uncompleted TX
packets but we don't know precisely when.

>
> While tg3_dump_state() is helpful, it also poses issues as it takes a
> considerable amount of time, approximately 1 or 2 seconds per device.
> Given our 4-port adapter, this could extend to more than 10 seconds to
> write to the dmesg buffer. With the default size, the dmesg buffer may
> be over-written and not retain all useful information.
>

If tg3_dump_state() is not useful and fills the dmesg log with useless
data, we can add the same check in tg3_dump_state() and skip it.
tg3_dump_state() is also called by tg3_process_error() so we can avoid
dumping useless data if we hit tg3_process_error() during EEH or AER.

Thanks.

Download attachment "smime.p7s" of type "application/pkcs7-signature" (4209 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ