lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250207143028.1865-5-shiju.jose@huawei.com>
Date: Fri, 7 Feb 2025 14:30:25 +0000
From: <shiju.jose@...wei.com>
To: <linux-edac@...r.kernel.org>, <linux-cxl@...r.kernel.org>,
	<mchehab@...nel.org>, <dave.jiang@...el.com>, <dan.j.williams@...el.com>,
	<bp@...en8.de>, <jonathan.cameron@...wei.com>, <alison.schofield@...el.com>,
	<vishal.l.verma@...el.com>, <ira.weiny@...el.com>, <dave@...olabs.net>
CC: <linux-kernel@...r.kernel.org>, <linuxarm@...wei.com>,
	<tanxiaofei@...wei.com>, <prime.zeng@...ilicon.com>, <shiju.jose@...wei.com>
Subject: [PATCH 4/4] rasdaemon: cxl: Add CXL memory repair boot-up script for unrepaired memory errors

From: Shiju Jose <shiju.jose@...wei.com>

Rasdaemon supports live memory repair for the CXL DRAM errors reported,
with 'maintenance needed' flag set. However the kernel CXL driver rejects
the request for the live memory repair in the following situations.
1. Memory is online and the repair is disruptive.
2. Memory is online and event record does not match.
In addition, live memory repair is not requested if the auto repair option
is switched off for the rasdaemon.

In the above unrepaired cases, rasdaemon stores the repair-needed
information in the DRAM event record of the SQLite database. This allows
a boot-up script to read repair needed flag and repair attributes from
the database. If the memory has not been repaired, the script will
issue the memory repair operation needed by the CXL memory device
in the previous boot. kernel CXL driver sends a repair command to the
device if the memory to be repaired is offline.

Add boot-up script for handling the unrepaired CXL DRAM memory errors
from the previous boot.

Signed-off-by: Shiju Jose <shiju.jose@...wei.com>
---
 util/cxl-mem-repair.sh | 189 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 189 insertions(+)
 create mode 100755 util/cxl-mem-repair.sh

diff --git a/util/cxl-mem-repair.sh b/util/cxl-mem-repair.sh
new file mode 100755
index 0000000..2e3d261
--- /dev/null
+++ b/util/cxl-mem-repair.sh
@@ -0,0 +1,189 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
+#
+# Boot-up script for CXL memory repair features.
+#
+
+CXL_MAINT_CLASS_SPARING=2
+
+CXL_MAINT_SUBCLASS_CACHELINE_SPARING=0
+CXL_MAINT_SUBCLASS_ROW_SPARING=1
+CXL_MAINT_SUBCLASS_BANK_SPARING=2
+CXL_MAINT_SUBCLASS_RANK_SPARING=3
+
+RASDAEMON_SQL_DB=/usr/local/var/lib/rasdaemon/ras-mc_event.db
+EDAC_CXL_BUS_PATH=/sys/bus/edac/devices/cxl_
+
+id=1
+idx=-1
+found_repair=-1
+repair_type=''
+
+while [ "$id" ]
+do
+	id=$(sqlite3  $RASDAEMON_SQL_DB "select id from cxl_dram_event where id=$id")
+	if [ -z "$id" ]
+	then
+		break;
+	fi
+
+	repair_needed=$(sqlite3 $RASDAEMON_SQL_DB "select repair_needed from cxl_dram_event where id=$id")
+	if [[ -z "$repair_needed" || $repair_needed -eq 0 ]]
+	then
+		id=$((id+1))
+		continue;
+	fi
+
+	maint_op_class=$(sqlite3  $RASDAEMON_SQL_DB "select hdr_maint_op_class from cxl_dram_event where id=$id")
+	if [ $maint_op_class -ne $CXL_MAINT_CLASS_SPARING ]
+	then
+		id=$((id+1))
+		continue;
+	fi
+
+	maint_op_sub_class=$(sqlite3 $RASDAEMON_SQL_DB "select hdr_maint_op_sub_class from cxl_dram_event where id=$id")
+	if [ -z "$maint_op_sub_class" ]
+	then
+		id=$((id+1))
+		continue;
+	fi
+
+	repair_type=''
+	if [ $maint_op_sub_class -eq $CXL_MAINT_SUBCLASS_CACHELINE_SPARING ]
+	then
+		repair_type='cacheline-sparing'
+	fi
+	if [ $maint_op_sub_class -eq $CXL_MAINT_SUBCLASS_ROW_SPARING ]
+	then
+		repair_type='row-sparing'
+	fi
+	if [ $maint_op_sub_class -eq $CXL_MAINT_SUBCLASS_BANK_SPARING ]
+	then
+		repair_type='bank-sparing'
+	fi
+	if [ $maint_op_sub_class -eq $CXL_MAINT_SUBCLASS_RANK_SPARING ]
+	then
+		repair_type='rank-sparing'
+	fi
+
+	memdev=$(sqlite3 $RASDAEMON_SQL_DB "select memdev from cxl_dram_event where id=$id")
+	if [ -z "$memdev" ]
+	then
+		id=$((id+1))
+		continue;
+	fi
+
+	# find the matching sparing type in sysfs
+	idx=0
+	found_repair=0
+	while [ 1 ]
+	do
+		out=$(cat "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/repair_type")
+		if [ -z "$out" ]
+		then
+			break;
+		fi
+
+		if [ "$repair_type" = "$out" ]
+		then
+			found_repair=1
+			break;
+		fi
+		idx=$((idx+1))
+	done
+	if [ $found_repair -eq 0 ]
+	then
+		id=$((id+1))
+		continue;
+	fi
+
+	if [[ $maint_op_sub_class == $CXL_MAINT_SUBCLASS_CACHELINE_SPARING  || $maint_op_sub_class == $CXL_MAINT_SUBCLASS_ROW_SPARING || $maint_op_sub_class == $CXL_MAINT_SUBCLASS_BANK_SPARING ]]
+		then
+		bank_group=$(sqlite3 $RASDAEMON_SQL_DB "select bank_group from cxl_dram_event where id=$id")
+		if [ "$bank_group" ]
+		then
+			echo $bank_group > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/bank_group"
+		else
+			id=$((id+1))
+			continue;
+		fi
+
+		bank=$(sqlite3 $RASDAEMON_SQL_DB "select bank from cxl_dram_event where id=$id")
+		if [ "$bank" ]
+		then
+			echo $bank > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/bank"
+		else
+			id=$((id+1))
+			continue;
+		fi
+
+		if [[ $maint_op_sub_class == $CXL_MAINT_SUBCLASS_CACHELINE_SPARING || $maint_op_sub_class == $CXL_MAINT_SUBCLASS_ROW_SPARING ]]
+		then
+			row=$(sqlite3 $RASDAEMON_SQL_DB "select row from cxl_dram_event where id=$id")
+			if [ "$row" ]
+			then
+				echo $row > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/row"
+			else
+				id=$((id+1))
+				continue;
+			fi
+		fi
+
+		if [ $maint_op_sub_class -eq $CXL_MAINT_SUBCLASS_CACHELINE_SPARING ]
+		then
+			column=$(sqlite3 $RASDAEMON_SQL_DB "select column from cxl_dram_event where id=$id")
+			if [ "$column" ]
+			then
+				echo $column > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/column"
+			else
+				id=$((id+1))
+				continue;
+			fi
+
+			sub_channel=$(sqlite3 $RASDAEMON_SQL_DB "select sub_channel from cxl_dram_event where id=$id")
+			if [ "$sub_channel" ]
+			then
+				echo $sub_channel > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/sub_channel"
+			else
+				id=$((id+1))
+				continue;
+			fi
+		fi
+	fi
+
+	channel=$(sqlite3 $RASDAEMON_SQL_DB "select channel from cxl_dram_event where id=$id")
+	if [ "$channel" ]
+	then
+		echo $channel > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/channel"
+	else
+		id=$((id+1))
+		continue;
+	fi
+
+	rank=$(sqlite3 $RASDAEMON_SQL_DB "select rank from cxl_dram_event where id=$id")
+	if [ "$rank" ]
+	then
+		echo $rank > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/rank"
+	else
+		id=$((id+1))
+		continue;
+	fi
+
+	nibble_mask=$(sqlite3 $RASDAEMON_SQL_DB "select nibble_mask from cxl_dram_event where id=$id")
+	if [ "$nibble_mask" ]
+	then
+		echo $nibble_mask > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/nibble_mask"
+	else
+		id=$((id+1))
+		continue;
+	fi
+
+	echo 1 > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/repair"
+
+	#Clear repair_needed field of cxl_dram_event table
+	$(sqlite3 $RASDAEMON_SQL_DB "update cxl_dram_event set repair_needed = 0 where id=$id")
+
+	id=$((id+1))
+done
-- 
2.43.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ