linux-kernel - Re: [RFC PATCH v2 4/4] acpi: apei: Warn when GHES marks correctable errors as "fatal"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180418175452.GK4795@pd.tnic>
Date:   Wed, 18 Apr 2018 19:54:52 +0200
From:   Borislav Petkov <bp@...en8.de>
To:     Alexandru Gagniuc <mr.nuke.me@...il.com>
Cc:     linux-acpi@...r.kernel.org, linux-edac@...r.kernel.org,
        rjw@...ysocki.net, lenb@...nel.org, tony.luck@...el.com,
        tbaicar@...eaurora.org, will.deacon@....com, james.morse@....com,
        shiju.jose@...wei.com, zjzhang@...eaurora.org,
        gengdongjiu@...wei.com, linux-kernel@...r.kernel.org,
        alex_gagniuc@...lteam.com, austin_bolen@...l.com,
        shyam_iyer@...l.com, devel@...ica.org, mchehab@...nel.org,
        robert.moore@...el.com, erik.schmauss@...el.com
Subject: Re: [RFC PATCH v2 4/4] acpi: apei: Warn when GHES marks correctable
 errors as "fatal"

On Mon, Apr 16, 2018 at 04:59:03PM -0500, Alexandru Gagniuc wrote:
> There seems to be a culture amongst BIOS teams to want to crash the
> OS when an error can't be handled in firmware. Marking GHES errors as
> "fatal" is a very common way to do this.
> 
> However, a number of errors reported by GHES may be fatal in the sense
> a device or link is lost, but are not fatal to the system. When there
> is a disagreement with firmware about the handleability of an error,
> print a warning message.
> 
> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@...il.com>
> ---
>  drivers/acpi/apei/ghes.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index e0528da4e8f8..6a117825611d 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -535,13 +535,14 @@ static const struct ghes_handler *get_handler(const guid_t *type)
>  static void ghes_do_proc(struct ghes *ghes,
>  			 const struct acpi_hest_generic_status *estatus)
>  {
> -	int sev, sec_sev;
> +	int sev, sec_sev, corrected_sev;
>  	struct acpi_hest_generic_data *gdata;
>  	const struct ghes_handler *handler;
>  	guid_t *sec_type;
>  	guid_t *fru_id = &NULL_UUID_LE;
>  	char *fru_text = "";
>  
> +	corrected_sev = GHES_SEV_NO;
>  	sev = ghes_severity(estatus->error_severity);
>  	apei_estatus_for_each_section(estatus, gdata) {
>  		sec_type = (guid_t *)gdata->section_type;
> @@ -563,6 +564,13 @@ static void ghes_do_proc(struct ghes *ghes,
>  					       sec_sev, err,
>  					       gdata->error_data_length);
>  		}
> +
> +		corrected_sev = max(corrected_sev, sec_sev);
> +	}
> +
> +	if ((sev >= GHES_SEV_PANIC) && (corrected_sev < sev)) {
> +		pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct");
> +		pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");

No, I don't want any of that crap issuing stuff in dmesg and then people
opening bugs and running around and trying to replace hardware.

We either can handle the error and log a normal record somewhere or we
cannot and explode. The complaining about the FW doesn't bring shit.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.