linux-kernel - Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180418175415.GJ4795@pd.tnic>
Date:   Wed, 18 Apr 2018 19:54:15 +0200
From:   Borislav Petkov <bp@...en8.de>
To:     Alexandru Gagniuc <mr.nuke.me@...il.com>
Cc:     linux-acpi@...r.kernel.org, linux-edac@...r.kernel.org,
        rjw@...ysocki.net, lenb@...nel.org, tony.luck@...el.com,
        tbaicar@...eaurora.org, will.deacon@....com, james.morse@....com,
        shiju.jose@...wei.com, zjzhang@...eaurora.org,
        gengdongjiu@...wei.com, linux-kernel@...r.kernel.org,
        alex_gagniuc@...lteam.com, austin_bolen@...l.com,
        shyam_iyer@...l.com, devel@...ica.org, mchehab@...nel.org,
        robert.moore@...el.com, erik.schmauss@...el.com
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable
 errors are marked as fatal.

On Mon, Apr 16, 2018 at 04:59:02PM -0500, Alexandru Gagniuc wrote:
> Firmware is evil:
>  - ACPI was created to "try and make the 'ACPI' extensions somehow
>  Windows specific" in order to "work well with NT and not the others
>  even if they are open"
>  - EFI was created to hide "secret" registers from the OS.
>  - UEFI was created to allow compromising an otherwise secure OS.
> 
> Never has firmware been created to solve a problem or simplify an
> otherwise cumbersome process. It is of no surprise then, that
> firmware nowadays intentionally crashes an OS.

I don't believe I'm saying this but, get rid of that rant. Even though I
agree, it doesn't belong in a commit message.

> 
> One simple way to do that is to mark GHES errors as fatal. Firmware
> knows and even expects that an OS will crash in this case. And most
> OSes do.
> 
> PCIe errors are notorious for having different definitions of "fatal".
> In ACPI, and other firmware sandards, 'fatal' means the machine is
> about to explode and needs to be reset. In PCIe, on the other hand,
> fatal means that the link to a device has died. In the hotplug world
> of PCIe, this is akin to a USB disconnect. From that view, the "fatal"
> loss of a link is a normal event. To allow a machine to crash in this
> case is downright idiotic.
> 
> To solve this, implement an IRQ safe handler for AER. This makes sure
> we have enough information to invoke the full AER handler later down
> the road, and tells ghes_notify_nmi that "It's all cool".
> ghes_notify_nmi() then gets calmed down a little, and doesn't panic().
> 
> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@...il.com>
> ---
>  drivers/acpi/apei/ghes.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 2119c51b4a9e..e0528da4e8f8 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -481,12 +481,26 @@ static int ghes_handle_aer(struct acpi_hest_generic_data *gdata, int sev)
>  	return ghes_severity(gdata->error_severity);
>  }
>  
> +static int ghes_handle_aer_irqsafe(struct acpi_hest_generic_data *gdata,
> +				   int sev)
> +{
> +	struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
> +
> +	/* The system can always recover from AER errors. */
> +	if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> +		pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO)
> +		return CPER_SEV_RECOVERABLE;
> +
> +	return ghes_severity(gdata->error_severity);
> +}

Well, Tyler touched that AER error severity handling recently and we had
it all nicely documented in the comment above ghes_handle_aer().

Your ghes_handle_aer_irqsafe() graft basically bypasses
ghes_handle_aer() instead of incorporating in it.

If all you wanna say is, the severity computation should go through all
the sections and look at each error's severity before making a decision,
then add that to ghes_severity() instead of doing that "deferrable"
severity dance.

And add the changes to the policy to the comment above
ghes_handle_aer(). I don't want any changes from people coming and going
and leaving us scratching heads why we did it this way.

And no need for those handlers and so on - make it simple first - then we
can talk more complex handling.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.