[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20161003.232911.145888579502087608.davem@davemloft.net>
Date: Mon, 03 Oct 2016 23:29:11 -0400 (EDT)
From: David Miller <davem@...emloft.net>
To: jeffrey.t.kirsher@...el.com
Cc: gpiccoli@...ux.vnet.ibm.com, netdev@...r.kernel.org,
nhorman@...hat.com, sassmann@...hat.com, jogreene@...hat.com,
guru.anbalagane@...cle.com, stable@...r.kernel.org
Subject: Re: [net-next] i40e: avoid NULL pointer dereference and recursive
errors on early PCI error
From: Jeff Kirsher <jeffrey.t.kirsher@...el.com>
Date: Mon, 3 Oct 2016 00:31:12 -0700
> From: Guilherme G Piccoli <gpiccoli@...ux.vnet.ibm.com>
>
> Although rare, it's possible to hit PCI error early on device
> probe, meaning possibly some structs are not entirely initialized,
> and some might even be completely uninitialized, leading to NULL
> pointer dereference.
>
> The i40e driver currently presents a "bad" behavior if device hits
> such early PCI error: firstly, the struct i40e_pf might not be
> attached to pci_dev yet, leading to a NULL pointer dereference on
> access to pf->state.
>
> Even checking if the struct is NULL and avoiding the access in that
> case isn't enough, since the driver cannot recover from PCI error
> that early; in our experiments we saw multiple failures on kernel
> log, like:
>
> [549.664] i40e 0007:01:00.1: Initial pf_reset failed: -15
> [549.664] i40e: probe of 0007:01:00.1 failed with error -15
> [...]
> [871.644] i40e 0007:01:00.1: The driver for the device stopped because the
> device firmware failed to init. Try updating your NVM image.
> [871.644] i40e: probe of 0007:01:00.1 failed with error -32
> [...]
> [872.516] i40e 0007:01:00.0: ARQ: Unknown event 0x0000 ignored
>
> Between the first probe failure (error -15) and the second (error -32)
> another PCI error happened due to the first bad probe. Also, driver
> started to flood console with those ARQ event messages.
>
> This patch will prevent these issues by allowing error recovery
> mechanism to remove the failed device from the system instead of
> trying to recover from early PCI errors during device probe.
>
> CC: <stable@...r.kernel.org>
> Signed-off-by: Guilherme G Piccoli <gpiccoli@...ux.vnet.ibm.com>
> Acked-by: Jacob Keller <jacob.e.keller@...el.com>
> Tested-by: Andrew Bowers <andrewx.bowers@...el.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@...el.com>
Applied.
Powered by blists - more mailing lists