lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <95bcbc2d-0f8c-e51a-f0fc-08ea8c5fca26@gmail.com>
Date:   Fri, 11 May 2018 12:01:52 -0500
From:   "Alex G." <mr.nuke.me@...il.com>
To:     Borislav Petkov <bp@...en8.de>
Cc:     alex_gagniuc@...lteam.com, austin_bolen@...l.com,
        shyam_iyer@...l.com, "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Len Brown <lenb@...nel.org>, Tony Luck <tony.luck@...el.com>,
        Mauro Carvalho Chehab <mchehab@...nel.org>,
        Robert Moore <robert.moore@...el.com>,
        Erik Schmauss <erik.schmauss@...el.com>,
        Tyler Baicar <tbaicar@...eaurora.org>,
        Will Deacon <will.deacon@....com>,
        James Morse <james.morse@....com>,
        Shiju Jose <shiju.jose@...wei.com>,
        "Jonathan (Zhixiong) Zhang" <zjzhang@...eaurora.org>,
        Dongjiu Geng <gengdongjiu@...wei.com>,
        linux-acpi@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-edac@...r.kernel.org, devel@...ica.org
Subject: Re: [RFC PATCH v4 3/3] acpi: apei: Do not panic() on PCIe errors
 reported through GHES

On 05/11/2018 11:29 AM, Borislav Petkov wrote:
> On Fri, May 11, 2018 at 11:12:25AM -0500, Alex G. wrote:
>>> I think *you* didn't get it: IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER) is not
>>> enough of a check to confirm that there actually *is* an AER driver to
>>> handle the errors. If you really want to make sure the driver is loaded
>>> and functioning, then you need an explicit registering mechanism or some
>>> other way of checking it really is there and handling errors.
>>
>> config ACPI_APEI_PCIEAER
>> 	bool "APEI PCIe AER logging/recovering support"
>> 	depends on ACPI_APEI && PCIEAER
>> 	help
>> 	  PCIe AER errors may be reported via APEI firmware first mode.
>> 	  Turn on this option to enable the corresponding support.
>>
>> PCIAER is not modularizable. QED
> 
> QED my ass.
> 
> Read the f*ck my email again: the presence of the *code* is
> not enough of a check to confirm the error has been handled.
> aer_recover_work_func() can fail as that kfifo_put() in
> aer_recover_queue() can too.
> 
> You need an *actual* confirmation that the error has been handled
> properly and *only* *then* not panic the system. Otherwise you are
> potentially leaving those errors unhandled.


"How is PCIe error severity dependent on whether the AER error reporting
 driver is enabled (and possibly not even loaded) on the system?"

Little about confirmation of error being handled was talked about either
in your **** email, or previous versions of this series.  And quite
frankly it's besides the scope of this patch.

The scope is to enable SURPRISE!!! removal of NVMe drives and PCIe
devices. For that purpose, we don't need confirmation that the error was
handled. Such a confirmation requires a rework of GHES handling, or at
least the interaction between GHES and AER, both of which I find to be
mostly satisfactory.

You can't at this point know if the error is going to be handled.
There's code further downstream to handle this. You also didn't like it
when I wanted to handle things downstream.

I understand your concern with unhandled AER errors evolving into MCE's.
That's extremely rare, but when it happens you still panic due to the
MCE. To give you an idea of the rarity, in several months of testing, I
was only able to reproduce MCEs once, and that was with a very defective
drive, and a very idiotic test case.

If you find this solution unacceptable, that's fine. We can fix it in
firmware. We can hide all the events from the OS, contain the downstream
ports, and simulate hot-remove interrupts. All in firmware, all the time.

Alex

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ