[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5dc58180-d3c0-a9f0-282f-4be433c94052@gmail.com>
Date: Tue, 22 May 2018 13:19:34 -0500
From: "Alex G." <mr.nuke.me@...il.com>
To: "Rafael J. Wysocki" <rafael@...nel.org>,
"Luck, Tony" <tony.luck@...el.com>
Cc: Borislav Petkov <bp@...en8.de>, alex_gagniuc@...lteam.com,
austin_bolen@...l.com, shyam_iyer@...l.com,
"Rafael J. Wysocki" <rjw@...ysocki.net>,
Len Brown <lenb@...nel.org>,
Tyler Baicar <tbaicar@...eaurora.org>,
Will Deacon <will.deacon@....com>,
James Morse <james.morse@....com>,
Shiju Jose <shiju.jose@...wei.com>,
"Jonathan (Zhixiong) Zhang" <zjzhang@...eaurora.org>,
Dongjiu Geng <gengdongjiu@...wei.com>,
ACPI Devel Maling List <linux-acpi@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to
ghes_cper_severity()
On 05/22/2018 01:10 PM, Rafael J. Wysocki wrote:
> On Tue, May 22, 2018 at 7:57 PM, Luck, Tony <tony.luck@...el.com> wrote:
>> On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote:
>>> I especially don't want to have the case where a PCIe error is *really*
>>> fatal and then we noodle in some handlers debating about the severity
>>> because it got marked as recoverable intermittently and end up causing
>>> data corruption on the storage device. Here's a real no-no for ya.
>>
>> All that we have is a message from the BIOS that this is a "fatal"
>> error. When did we start trusting the BIOS to give us accurate
>> information?
>
> Some time ago, actually.
>
> This is about changing the existing behavior which has been to treat
> "fatal" errors reported by the BIOS as good enough reasons for a panic
> for quite a while AFAICS.
Yes, you are correct. I'd actually like to go deeper, and remove the
policy to panic() on fatal errors. Now whether we blacklist or whitelist
which errors can go through is up for debate, but the current policy
seems broken.
>> PCIe fatal means that the link or the device is broken.
>
> And that may really mean that the component in question is on fire.
> We just don't know.
Should there be a physical fire, we have much bigger issues. At the same
time, we could retrain the link, call the driver, and release freon gas
to put out the fire. That's something we don't currently have the option
to do.
>> But that seems a poor reason to take down a large server that may have
>> dozens of devices (some of them set up specifically to handle
>> errors ... e.g. mirrored disks on separate controllers, or NIC
>> devices that have been "bonded" together).
>>
>> So, as long as the action for a "fatal" error is to mark a link
>> down and offline the device, that seems a pretty reasonable course
>> of action.
>>
>> The argument gets a lot more marginal if you simply reset the
>> link and re-enable the device to "fix" it. That might be enough,
>> but I don't think the OS has enough data to make the call.
>
> Again, that's about changing the existing behavior or the existing policy even.
>
> What exactly has changed to make us consider this now?
Firmware started passing "fatal" GHES headers with the explicit intent
of crashing an OS. At the same time, we've learnt how to handle these
errors in a number of cases. With DPC (coming soon to firmware-first)
the error is contained, and a non-issue.
As specs and hardware implementations evolve, we have to adapt. I'm here
until November, and one of my goals is to involve linux upstream in the
development of these features so that when the hardware hits the market,
we're ready. That does mean we have to drop some of the silly things
we're doing.
Alex
> Thanks,
> Rafael
>
Powered by blists - more mailing lists