[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180418175452.GK4795@pd.tnic>
Date: Wed, 18 Apr 2018 19:54:52 +0200
From: Borislav Petkov <bp@...en8.de>
To: Alexandru Gagniuc <mr.nuke.me@...il.com>
Cc: linux-acpi@...r.kernel.org, linux-edac@...r.kernel.org,
rjw@...ysocki.net, lenb@...nel.org, tony.luck@...el.com,
tbaicar@...eaurora.org, will.deacon@....com, james.morse@....com,
shiju.jose@...wei.com, zjzhang@...eaurora.org,
gengdongjiu@...wei.com, linux-kernel@...r.kernel.org,
alex_gagniuc@...lteam.com, austin_bolen@...l.com,
shyam_iyer@...l.com, devel@...ica.org, mchehab@...nel.org,
robert.moore@...el.com, erik.schmauss@...el.com
Subject: Re: [RFC PATCH v2 4/4] acpi: apei: Warn when GHES marks correctable
errors as "fatal"
On Mon, Apr 16, 2018 at 04:59:03PM -0500, Alexandru Gagniuc wrote:
> There seems to be a culture amongst BIOS teams to want to crash the
> OS when an error can't be handled in firmware. Marking GHES errors as
> "fatal" is a very common way to do this.
>
> However, a number of errors reported by GHES may be fatal in the sense
> a device or link is lost, but are not fatal to the system. When there
> is a disagreement with firmware about the handleability of an error,
> print a warning message.
>
> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@...il.com>
> ---
> drivers/acpi/apei/ghes.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index e0528da4e8f8..6a117825611d 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -535,13 +535,14 @@ static const struct ghes_handler *get_handler(const guid_t *type)
> static void ghes_do_proc(struct ghes *ghes,
> const struct acpi_hest_generic_status *estatus)
> {
> - int sev, sec_sev;
> + int sev, sec_sev, corrected_sev;
> struct acpi_hest_generic_data *gdata;
> const struct ghes_handler *handler;
> guid_t *sec_type;
> guid_t *fru_id = &NULL_UUID_LE;
> char *fru_text = "";
>
> + corrected_sev = GHES_SEV_NO;
> sev = ghes_severity(estatus->error_severity);
> apei_estatus_for_each_section(estatus, gdata) {
> sec_type = (guid_t *)gdata->section_type;
> @@ -563,6 +564,13 @@ static void ghes_do_proc(struct ghes *ghes,
> sec_sev, err,
> gdata->error_data_length);
> }
> +
> + corrected_sev = max(corrected_sev, sec_sev);
> + }
> +
> + if ((sev >= GHES_SEV_PANIC) && (corrected_sev < sev)) {
> + pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct");
> + pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");
No, I don't want any of that crap issuing stuff in dmesg and then people
opening bugs and running around and trying to replace hardware.
We either can handle the error and log a normal record somewhere or we
cannot and explode. The complaining about the FW doesn't bring shit.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
Powered by blists - more mailing lists