[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <df85813c9860463d85f6c302dfe07b12@ausx13mps321.AMER.DELL.COM>
Date:   Mon, 12 Nov 2018 20:05:41 +0000
From:   <Alex_Gagniuc@...lteam.com>
To:     <oohall@...il.com>, <gregkh@...uxfoundation.org>
Cc:     <keith.busch@...el.com>, <helgaas@...nel.org>,
        <mr.nuke.me@...il.com>, <linux-pci@...r.kernel.org>,
        <Austin.Bolen@...l.com>, <Shyam.Iyer@...l.com>,
        <linux-kernel@...r.kernel.org>, <jonathan.derrick@...el.com>,
        <lukas@...ner.de>, <ruscur@...sell.cc>, <sbobroff@...ux.ibm.com>,
        <linuxppc-dev@...ts.ozlabs.org>
Subject: Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is
 disconnected
On 11/11/2018 11:50 PM, Oliver O'Halloran wrote:
> 
> [EXTERNAL EMAIL]
> Please report any suspicious attachments, links, or requests for sensitive information.
> 
> 
> On Thu, 2018-11-08 at 23:06 +0000, Alex_Gagniuc@...lteam.com wrote:
>> On 11/08/2018 04:51 PM, Greg KH wrote:
>>> On Thu, Nov 08, 2018 at 10:49:08PM +0000, Alex_Gagniuc@...lteam.com wrote:
>>>> In the case that we're trying to fix, this code executing is a result of
>>>> the device being gone, so we can guarantee race-free operation. I agree
>>>> that there is a race, in the general case. As far as checking the result
>>>> for all F's, that's not an option when firmware crashes the system as a
>>>> result of the mmio read/write. It's never pretty when firmware gets
>>>> involved.
>>>
>>> If you have firmware that crashes the system when you try to read from a
>>> PCI device that was hot-removed, that is broken firmware and needs to be
>>> fixed.  The kernel can not work around that as again, you will never win
>>> that race.
>>
>> But it's not the firmware that crashes. It's linux as a result of a
>> fatal error message from the firmware. And we can't fix that because FFS
>> handling requires that the system reboots [1].
> 
> Do we know the exact circumsances that result in firmware requesting a
> reboot? If it happen on any PCIe error I don't see what we can do to
> prevent that beyond masking UEs entirely (are we even allowed to do
> that on FFS systems?).
Pull a drive out at an angle, push two drives in at the same time, pull 
out a drive really slow. If an error is even reported to the OS depends 
on PD state, and proprietary mechanisms and logic in the HW and FW. OS 
is not supposed to mask errors (touch AER bits) on FFS.
Sadly, with FFS, behavior can and does change from BIOS version to BIOS 
version. On one product, for example, we eliminated a lot of crashes by 
simply not reporting some classes of PCIe errors to the OS.
Alex
>> If we're going to say that we don't want to support FFS because it's a
>> separate code path, and different flow, that's fine. I am myself, not a
>> fan of FFS. But if we're going to continue supporting it, I think we'll
>> continue to have to resolve these sort of unintended consequences.
>>
>> Alex
>>
>> [1] ACPI 6.2, 18.1 - Hardware Errors and Error Sources
> 
> 
Powered by blists - more mailing lists
 
