[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0hdgxsDiXqOmeqBQoZUQJ1RssM=3jpYpWt3qzy0n2eyaA@mail.gmail.com>
Date: Fri, 28 Oct 2022 19:08:13 +0200
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Shuai Xue <xueshuai@...ux.alibaba.com>
Cc: rafael@...nel.org, lenb@...nel.org, james.morse@....com,
tony.luck@...el.com, bp@...en8.de, dave.hansen@...ux.intel.com,
jarkko@...nel.org, naoya.horiguchi@....com, linmiaohe@...wei.com,
akpm@...ux-foundation.org, stable@...r.kernel.org,
linux-acpi@...r.kernel.org, linux-kernel@...r.kernel.org,
cuibixuan@...ux.alibaba.com, baolin.wang@...ux.alibaba.com,
zhuo.song@...ux.alibaba.com
Subject: Re: [PATCH] ACPI: APEI: set memory failure flags as
MF_ACTION_REQUIRED on action required events
On Thu, Oct 27, 2022 at 6:25 AM Shuai Xue <xueshuai@...ux.alibaba.com> wrote:
>
> There are two major types of uncorrected error (UC) :
>
> - Action Required: The error is detected and the processor already consumes the
> memory. OS requires to take action (for example, offline failure page/kill
> failure thread) to recover this uncorrectable error.
>
> - Action Optional: The error is detected out of processor execution context.
> Some data in the memory are corrupted. But the data have not been consumed.
> OS is optional to take action to recover this uncorrectable error.
>
> For X86 platforms, we can easily distinguish between these two types
> based on the MCA Bank. While for arm64 platform, the memory failure
> flags for all UCs which severity are GHES_SEV_RECOVERABLE are set as 0,
> a.k.a, Action Optional now.
>
> If UC is detected by a background scrubber, it is obviously an Action
> Optional error. For other errors, we should conservatively regard them
> as Action Required.
>
> cper_sec_mem_err::error_type identifies the type of error that occurred
> if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0
> for Scrub Uncorrected Error (type 14). Otherwise, set memory failure
> flags as MF_ACTION_REQUIRED.
>
> Signed-off-by: Shuai Xue <xueshuai@...ux.alibaba.com>
I need input from the APEI reviewers on this.
Thanks!
> ---
> drivers/acpi/apei/ghes.c | 10 ++++++++--
> include/linux/cper.h | 3 +++
> 2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 80ad530583c9..6c03059cbfc6 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -474,8 +474,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> if (sec_sev == GHES_SEV_CORRECTED &&
> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
> flags = MF_SOFT_OFFLINE;
> - if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
> - flags = 0;
> + if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) {
> + if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE)
> + flags = mem_err->error_type == CPER_MEM_SCRUB_UC ?
> + 0 :
> + MF_ACTION_REQUIRED;
> + else
> + flags = MF_ACTION_REQUIRED;
> + }
>
> if (flags != -1)
> return ghes_do_memory_failure(mem_err->physical_addr, flags);
> diff --git a/include/linux/cper.h b/include/linux/cper.h
> index eacb7dd7b3af..b77ab7636614 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h
> @@ -235,6 +235,9 @@ enum {
> #define CPER_MEM_VALID_BANK_ADDRESS 0x100000
> #define CPER_MEM_VALID_CHIP_ID 0x200000
>
> +#define CPER_MEM_SCRUB_CE 13
> +#define CPER_MEM_SCRUB_UC 14
> +
> #define CPER_MEM_EXT_ROW_MASK 0x3
> #define CPER_MEM_EXT_ROW_SHIFT 16
>
> --
> 2.20.1.9.gb50a0d7
>
Powered by blists - more mailing lists