lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20180403170830.29282-1-mr.nuke.me@gmail.com>
Date:   Tue,  3 Apr 2018 12:08:26 -0500
From:   Alexandru Gagniuc <mr.nuke.me@...il.com>
To:     linux-acpi@...r.kernel.org
Cc:     rjw@...ysocki.net, lenb@...nel.org, tony.luck@...el.com,
        bp@...en8.de, tbaicar@...eaurora.org, will.deacon@....com,
        james.morse@....com, shiju.jose@...wei.com, zjzhang@...eaurora.org,
        gengdongjiu@...wei.com, linux-kernel@...r.kernel.org,
        alex_gagniuc@...lteam.com, austin_bolen@...l.com,
        shyam_iyer@...l.com, Alexandru Gagniuc <mr.nuke.me@...il.com>
Subject: [RFC PATCH 0/4]  acpi: apei: Improve error handling with firmware-first

Hi,

I'm helping out Dell work out through the issues related to PCIe and NVMe
hotplug. Although hot-plug generally works, there are corner cases such as
pin bounce, drives failing and surprise removal that are not 100% worked out.
Because of this, NVMe is not yet on feature parity with SCSI and SAS.

One of the interesting issues is that most server vendors like to use
firmware-first (FFS), for various reasons that I won't go into. The side
effect of that is that we oftentimes don't even a stab at correcting the
problem.

This is especially troublesome for NVMe, which needs PCIe hotplug to work
correctly. When we do get a stab, it's after FFS can't handle a fatal error,
and we're told of it through ACPI tables. On x86, this happens through an
NMI, and as soon as we see a "fatal" error, we panic().

One problem with this FFS approach is that AER never even gets notified of
the issue. And even if a PCIe drive were to stop responding, nvme or higher
block drivers would notice something is wrong even without AER. Unless there
is a physical defect or silicon bug, AER can recover the link.

Another issue we're seeing with FFS is that BIOSes assume than an OS will crash
on a fatal error reported through ACPI. Sometimes they will leave hardware in
a "kind of" working state, or will fail to re-arm the errors. From what I've
observed, this happens on hardware with silicon bugs. For example, PCIe root
ports are unaffected, but certain PCIe switches may stop issuing hotplug
interrupts. It's just another headache with FFS.

While I don't expect server vendors to drop FFS in favor of native AER control,
I do think we can harden linux against the idiosyncrasies of such systems. The
scope of these patches is to protect against poorly designed firmware, and
perform proper error handling when possible. It is not to make FFS a first
class citizen in error handling.

Alexandru Gagniuc (4):
  acpi: apei: Return severity of GHES messages after handling
  acpi: apei: Swap ghes_print_queued_estatus and ghes_proc_in_irq
  acpi: apei: Do not panic() in NMI because of GHES messages
  acpi: apei: Warn when GHES marks correctable errors as "fatal"

 drivers/acpi/apei/ghes.c | 100 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 64 insertions(+), 36 deletions(-)

--
2.14.3

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ