linux-kernel - RE: Questions: Should kernel panic when PCIe fatal error occurs?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <2e5870e416f84e8fad8340061ec303e2@AcuMS.aculab.com>
Date:   Thu, 21 Sep 2023 13:20:58 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Shuai Xue' <xueshuai@...ux.alibaba.com>,
        Bjorn Helgaas <helgaas@...nel.org>
CC:     "Rafael J. Wysocki" <rafael@...nel.org>,
        "gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>,
        Linux PCI <linux-pci@...r.kernel.org>,
        "mahesh@...ux.ibm.com" <mahesh@...ux.ibm.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
        "bp@...en8.de" <bp@...en8.de>,
        Baolin Wang <baolin.wang@...ux.alibaba.com>,
        Jonathan Cameron <Jonathan.Cameron@...wei.com>,
        "bhelgaas@...gle.com" <bhelgaas@...gle.com>,
        "james.morse@....com" <james.morse@....com>,
        "linuxppc-dev@...ts.ozlabs.org" <linuxppc-dev@...ts.ozlabs.org>,
        "lenb@...nel.org" <lenb@...nel.org>
Subject: RE: Questions: Should kernel panic when PCIe fatal error occurs?

...
I've got a target to generate AER errors by generating read cycles
that are inside the address range that the bridge forwards but
outside of any BAR because there are 2 different sized BARs.
(Pretty easy to setup.)
On the system I was using they didn't get propagated all the way
to the root bridge - but were visible in the lower bridge.
It would be nice for a driver to be able to detect/clear such
a flag if it gets an unexpected ~0u read value.
(I'm not sure an error callback helps.)

OTOH a 'nebs compliant' server routed any kind of PCIe link error
through to some 'system management' logic that then raised an NMI.
I'm not sure who thought an NMI was a good idea - they are pretty
impossible to handle in the kernel and too late to be of use to
the code performing the access.

In any case we were getting one after 'echo 1 >xxx/remove' and
then taking the PCIe link down by reprogramming the fpga.
So the link going down was entirely expected, but there seemed
to be nothing we could do to stop the kernel crashing.

I'm sure 'nebs compliant' ought to contain some requirements for
resilience to hardware failures!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)