lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 27 Oct 2022 12:24:45 +0800
From:   Shuai Xue <xueshuai@...ux.alibaba.com>
To:     rafael@...nel.org, lenb@...nel.org, james.morse@....com,
        tony.luck@...el.com, bp@...en8.de, dave.hansen@...ux.intel.com,
        jarkko@...nel.org, naoya.horiguchi@....com, linmiaohe@...wei.com,
        akpm@...ux-foundation.org
Cc:     stable@...r.kernel.org, linux-acpi@...r.kernel.org,
        linux-kernel@...r.kernel.org, cuibixuan@...ux.alibaba.com,
        baolin.wang@...ux.alibaba.com, zhuo.song@...ux.alibaba.com,
        xueshuai@...ux.alibaba.com
Subject: [PATCH] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on action required events

There are two major types of uncorrected error (UC) :

- Action Required: The error is detected and the processor already consumes the
  memory. OS requires to take action (for example, offline failure page/kill
  failure thread) to recover this uncorrectable error.

- Action Optional: The error is detected out of processor execution context.
  Some data in the memory are corrupted. But the data have not been consumed.
  OS is optional to take action to recover this uncorrectable error.

For X86 platforms, we can easily distinguish between these two types
based on the MCA Bank. While for arm64 platform, the memory failure
flags for all UCs which severity are GHES_SEV_RECOVERABLE are set as 0,
a.k.a, Action Optional now.

If UC is detected by a background scrubber, it is obviously an Action
Optional error.  For other errors, we should conservatively regard them
as Action Required.

cper_sec_mem_err::error_type identifies the type of error that occurred
if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0
for Scrub Uncorrected Error (type 14). Otherwise, set memory failure
flags as MF_ACTION_REQUIRED.

Signed-off-by: Shuai Xue <xueshuai@...ux.alibaba.com>
---
 drivers/acpi/apei/ghes.c | 10 ++++++++--
 include/linux/cper.h     |  3 +++
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 80ad530583c9..6c03059cbfc6 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -474,8 +474,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
 	if (sec_sev == GHES_SEV_CORRECTED &&
 	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
 		flags = MF_SOFT_OFFLINE;
-	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
-		flags = 0;
+	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) {
+		if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE)
+			flags = mem_err->error_type == CPER_MEM_SCRUB_UC ?
+					0 :
+					MF_ACTION_REQUIRED;
+		else
+			flags = MF_ACTION_REQUIRED;
+	}
 
 	if (flags != -1)
 		return ghes_do_memory_failure(mem_err->physical_addr, flags);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index eacb7dd7b3af..b77ab7636614 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -235,6 +235,9 @@ enum {
 #define CPER_MEM_VALID_BANK_ADDRESS		0x100000
 #define CPER_MEM_VALID_CHIP_ID			0x200000
 
+#define CPER_MEM_SCRUB_CE			13
+#define CPER_MEM_SCRUB_UC			14
+
 #define CPER_MEM_EXT_ROW_MASK			0x3
 #define CPER_MEM_EXT_ROW_SHIFT			16
 
-- 
2.20.1.9.gb50a0d7

Powered by blists - more mailing lists