lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180522175742.GA3543@agluck-desk>
Date:   Tue, 22 May 2018 10:57:42 -0700
From:   "Luck, Tony" <tony.luck@...el.com>
To:     Borislav Petkov <bp@...en8.de>
Cc:     "Alex G." <mr.nuke.me@...il.com>,
        "Rafael J. Wysocki" <rafael@...nel.org>, alex_gagniuc@...lteam.com,
        austin_bolen@...l.com, shyam_iyer@...l.com,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Len Brown <lenb@...nel.org>,
        Tyler Baicar <tbaicar@...eaurora.org>,
        Will Deacon <will.deacon@....com>,
        James Morse <james.morse@....com>,
        Shiju Jose <shiju.jose@...wei.com>,
        "Jonathan (Zhixiong) Zhang" <zjzhang@...eaurora.org>,
        Dongjiu Geng <gengdongjiu@...wei.com>,
        ACPI Devel Maling List <linux-acpi@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to
 ghes_cper_severity()

On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote:
> I especially don't want to have the case where a PCIe error is *really*
> fatal and then we noodle in some handlers debating about the severity
> because it got marked as recoverable intermittently and end up causing
> data corruption on the storage device. Here's a real no-no for ya.

All that we have is a message from the BIOS that this is a "fatal"
error.  When did we start trusting the BIOS to give us accurate
information?

PCIe fatal means that the link or the device is broken. But that
seems a poor reason to take down a large server that may have
dozens of devices (some of them set up specifically to handle
errors ... e.g. mirrored disks on separate controllers, or NIC
devices that have been "bonded" together).

So, as long as the action for a "fatal" error is to mark a link
down and offline the device, that seems a pretty reasonable course
of action.

The argument gets a lot more marginal if you simply reset the
link and re-enable the device to "fix" it. That might be enough,
but I don't think the OS has enough data to make the call.

-Tony

P.S. I deliberately put "fatal" in quotes above because to
quote "The Princess Bride" -- "that word, I do not think it
means what you think it means". :-)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ