linux-kernel - Re: [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5ae14b67-94a5-6d2f-b74d-ca32bbd079cd@kernel.org>
Date:   Mon, 12 Oct 2020 10:50:29 -0400
From:   Sinan Kaya <okaya@...nel.org>
To:     sathyanarayanan.nkuppuswamy@...il.com, bhelgaas@...gle.com
Cc:     linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
        ashok.raj@...el.com, sathyanarayanan.kuppuswamy@...ux.intel.com
Subject: Re: [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error
 recovery handling

On 10/12/2020 1:03 AM, sathyanarayanan.nkuppuswamy@...il.com wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@...ux.intel.com>
> 
> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> merged fatal and non-fatal error recovery paths, and also made
> recovery code depend on hotplug handler for "remove affected
> device + rescan" support. But this change also complicated the
> error recovery path and which in turn led to the following
> issues.
> 
> 1. We depend on hotplug handler for removing the affected
> devices/drivers on DLLSC LINK down event (on DPC event
> trigger) and DPC handler for handling the error recovery. Since
> both handlers operate on same set of affected devices, it leads
> to race condition, which in turn leads to  NULL pointer
> exceptions or error recovery failures.You can find more details
> about this issue in following link.
> 
> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.zhao@intel.com/T/#t
> 
> 2. For non-hotplug capable devices fatal (DPC) error recovery
> is currently broken. Current fatal error recovery implementation
> relies on PCIe hotplug (pciehp) handler for detaching and
> re-enumerating the affected devices/drivers. So when dealing with
> non-hotplug capable devices, recovery code does not restore the state
> of the affected devices correctly. You can find more details about
> this issue in the following links.
> 
> https://lore.kernel.org/linux-pci/20200527083130.4137-1-Zhiqiang.Hou@nxp.com/
> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com/
> 
> In order to fix the above two issues, we should stop relying on hotplug
> handler for cleaning the affected devices/drivers and let error recovery
> handler own this functionality. So this patch reverts Commit bdb5ac85777d
> ("PCI/ERR: Handle fatal error recovery") and re-introduce the  "remove
> affected device + rescan"  functionality in fatal error recovery handler.
> 
> Also holding pci_lock_rescan_remove() will prevent the race between hotplug
> and DPC handler.
> 
> Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@...ux.intel.com>
> ---
>  Documentation/PCI/pci-error-recovery.rst | 47 ++++++++++------
>  drivers/pci/pcie/err.c                   | 71 +++++++++++++++++++-----
>  2 files changed, 87 insertions(+), 31 deletions(-)

I'm not sure about locks involved but I do like the concept.
Current fatal error handling is best effort.

There is no way to recover if link is down by the time we
reach to fatal error handling routine.

This change will make the solution more reliable.

Reviewed-by: Sinan Kaya <okaya@...nel.org>