[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250207143028.1865-4-shiju.jose@huawei.com>
Date: Fri, 7 Feb 2025 14:30:24 +0000
From: <shiju.jose@...wei.com>
To: <linux-edac@...r.kernel.org>, <linux-cxl@...r.kernel.org>,
<mchehab@...nel.org>, <dave.jiang@...el.com>, <dan.j.williams@...el.com>,
<bp@...en8.de>, <jonathan.cameron@...wei.com>, <alison.schofield@...el.com>,
<vishal.l.verma@...el.com>, <ira.weiny@...el.com>, <dave@...olabs.net>
CC: <linux-kernel@...r.kernel.org>, <linuxarm@...wei.com>,
<tanxiaofei@...wei.com>, <prime.zeng@...ilicon.com>, <shiju.jose@...wei.com>
Subject: [PATCH 3/4] rasdaemon: cxl: Add storing memory repair needed info in the DRAM event record
From: Shiju Jose <shiju.jose@...wei.com>
Rasdaemon supports live memory repair for the CXL DRAM errors reported,
with 'maintenance needed' flag set. However the kernel CXL driver rejects
the request for the live memory repair in the following situations.
1. Memory is online and the repair is disruptive.
2. Memory is online and event record does not match.
In addition, live memory repair is not requested if the auto repair option
is switched off for the rasdaemon.
In the above unrepaired cases, repair-needed information for CXL DRAM
events must be stored in the CXL DRAM event record of the SQLite database.
This allows a boot-up script to read repair status and repair attributes
in the next boot. If the memory has not been repaired, the script will
issue the memory repair operation requested by the memory device in the
previous boot. The kernel CXL driver sends a repair command to the device
if the memory to be repaired is offline.
This change adds a field for storing the memory repair needed info in the
CXL DRAM event record of the SQLite database and ensures that the repair
needed status is stored.
Signed-off-by: Shiju Jose <shiju.jose@...wei.com>
---
ras-cxl-handler.c | 9 ++++++---
ras-record.c | 2 ++
ras-record.h | 1 +
3 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/ras-cxl-handler.c b/ras-cxl-handler.c
index ae49740..25ef5c9 100644
--- a/ras-cxl-handler.c
+++ b/ras-cxl-handler.c
@@ -1609,10 +1609,13 @@ int ras_cxl_dram_event_handler(struct trace_seq *s,
if (trace_seq_printf(s, "CVME Count:%u ", ev.cvme_count) <= 0)
return -1;
- cxl_dram_sparing(&ev);
+ if (cxl_dram_sparing(&ev) < 0)
+ ev.repair_needed = true;
- if (!(ev.dpa_flags & CXL_DPA_NOT_REPAIRABLE))
- cxl_ppr(&ev.hdr, ev.dpa, ev.nibble_mask);
+ if (!(ev.dpa_flags & CXL_DPA_NOT_REPAIRABLE)) {
+ if (cxl_ppr(&ev.hdr, ev.dpa, ev.nibble_mask) < 0)
+ ev.repair_needed = true;
+ }
/* Insert data into the SGBD */
#ifdef HAVE_SQLITE3
diff --git a/ras-record.c b/ras-record.c
index eed7aca..ed745f9 100644
--- a/ras-record.c
+++ b/ras-record.c
@@ -993,6 +993,7 @@ static const struct db_fields cxl_dram_event_fields[] = {
{ .name = "sub_channel", .type = "INTEGER" },
{ .name = "cme_threshold_ev_flags", .type = "INTEGER" },
{ .name = "cvme_count", .type = "INTEGER" },
+ { .name = "repair_needed", .type = "INTEGER" },
};
static const struct db_table_descriptor cxl_dram_event_tab = {
@@ -1043,6 +1044,7 @@ int ras_store_cxl_dram_event(struct ras_events *ras, struct ras_cxl_dram_event *
sqlite3_bind_int(priv->stmt_cxl_dram_event, idx++,
ev->cme_threshold_ev_flags);
sqlite3_bind_int(priv->stmt_cxl_dram_event, idx++, ev->cvme_count);
+ sqlite3_bind_int(priv->stmt_cxl_dram_event, idx++, ev->repair_needed);
rc = sqlite3_step(priv->stmt_cxl_dram_event);
if (rc != SQLITE_OK && rc != SQLITE_DONE)
diff --git a/ras-record.h b/ras-record.h
index 5eab62c..35edae1 100644
--- a/ras-record.h
+++ b/ras-record.h
@@ -238,6 +238,7 @@ struct ras_cxl_dram_event {
uint8_t res_id[CXL_PLDM_RES_ID_LEN];
uint8_t cme_threshold_ev_flags;
uint32_t cvme_count;
+ bool repair_needed;
};
struct ras_cxl_memory_module_event {
--
2.43.0
Powered by blists - more mailing lists