lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 21 Sep 2023 13:20:58 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Shuai Xue' <xueshuai@...ux.alibaba.com>,
        Bjorn Helgaas <helgaas@...nel.org>
CC:     "Rafael J. Wysocki" <rafael@...nel.org>,
        "gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>,
        Linux PCI <linux-pci@...r.kernel.org>,
        "mahesh@...ux.ibm.com" <mahesh@...ux.ibm.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
        "bp@...en8.de" <bp@...en8.de>,
        Baolin Wang <baolin.wang@...ux.alibaba.com>,
        Jonathan Cameron <Jonathan.Cameron@...wei.com>,
        "bhelgaas@...gle.com" <bhelgaas@...gle.com>,
        "james.morse@....com" <james.morse@....com>,
        "linuxppc-dev@...ts.ozlabs.org" <linuxppc-dev@...ts.ozlabs.org>,
        "lenb@...nel.org" <lenb@...nel.org>
Subject: RE: Questions: Should kernel panic when PCIe fatal error occurs?

...
I've got a target to generate AER errors by generating read cycles
that are inside the address range that the bridge forwards but
outside of any BAR because there are 2 different sized BARs.
(Pretty easy to setup.)
On the system I was using they didn't get propagated all the way
to the root bridge - but were visible in the lower bridge.
It would be nice for a driver to be able to detect/clear such
a flag if it gets an unexpected ~0u read value.
(I'm not sure an error callback helps.)

OTOH a 'nebs compliant' server routed any kind of PCIe link error
through to some 'system management' logic that then raised an NMI.
I'm not sure who thought an NMI was a good idea - they are pretty
impossible to handle in the kernel and too late to be of use to
the code performing the access.

In any case we were getting one after 'echo 1 >xxx/remove' and
then taking the PCIe link down by reprogramming the fpga.
So the link going down was entirely expected, but there seemed
to be nothing we could do to stop the kernel crashing.

I'm sure 'nebs compliant' ought to contain some requirements for
resilience to hardware failures!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ