lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250207143028.1865-1-shiju.jose@huawei.com>
Date: Fri, 7 Feb 2025 14:30:21 +0000
From: <shiju.jose@...wei.com>
To: <linux-edac@...r.kernel.org>, <linux-cxl@...r.kernel.org>,
	<mchehab@...nel.org>, <dave.jiang@...el.com>, <dan.j.williams@...el.com>,
	<bp@...en8.de>, <jonathan.cameron@...wei.com>, <alison.schofield@...el.com>,
	<vishal.l.verma@...el.com>, <ira.weiny@...el.com>, <dave@...olabs.net>
CC: <linux-kernel@...r.kernel.org>, <linuxarm@...wei.com>,
	<tanxiaofei@...wei.com>, <prime.zeng@...ilicon.com>, <shiju.jose@...wei.com>
Subject: [PATCH 0/4] rasdaemon: cxl: Add support for memory repair operations

From: Shiju Jose <shiju.jose@...wei.com>

CXL devices provide error records for both corrected and uncorrectable
memory errors. These errors may reflect one off corruption event
(no increase in likelihood or repeat) or be related to a hardware problem
(more likely to repeat). There are many factors in predicting which case
we have.  This patch set focuses on one particular case in which the
device is making a judgement on whether a repeated problem is likely and
suggesting to the OS that it take remedial actions.

CXL spec 3.1, Section 8.2.9.2.1, Table 8-43, "Common Event Record Format"
table defines the Event Record Flags: 'Maintenance Needed' flag, which
indicates if the memory device requires maintenance. CXL DRAM and general
media event handlers exports to userspace (via a tracepoint) the attributes
needed for memory sparing or PPR. These are then available for writing back
to the EDAC memory repair sysfs interface, initiating the sparing/PPR
operation in the CXL memory device.

Firstly this series enables rasdaemon to close the loop and perform live
memory sparing and PPR operations.

Rasdaemon supports live memory repair for the CXL DRAM errors reported,
with 'maintenance needed' flag set. However the kernel CXL driver rejects
the request for the live memory repair in the following situations.
1. Memory is online and the repair is disruptive.
2. Memory is online and event record does not match.
In addition, live memory repair is not requested if the auto repair option
is switched off for the rasdaemon.

In the above unrepaired cases, repair-needed information for CXL DRAM
events must be stored in the CXL DRAM event record of the SQLite database.
This allows a boot-up script to read repair status and repair attributes
in the next boot. If the memory has not been repaired, the script will
issue the memory repair operation requested by the memory device in the
previous boot. The kernel CXL driver sends a repair command to the device
if the memory to be repaired is offline.

Add CXL memory repair boot-up script for handling the unrepaired
CXL DRAM errors from the previous boot.

Notes:
1. The series implemented userspace code for CXL memory repairs using the
   proposed EDAC memory repair interface. [1]
   
2. The code is based on v2 of rasdaemon: cxl: Update CXL event logging and
   recording to CXL spec rev 3.1. [2]

1. https://lore.kernel.org/linux-cxl/20250106121017.1620-1-shiju.jose@huawei.com/T/#maf191b2a104591f993da00249e67bd483ab67ce0
2. https://lore.kernel.org/lkml/20250110122641.1668-1-shiju.jose@huawei.com/

Shiju Jose (4):
  rasdaemon: cxl: Add support for memory sparing operation
  rasdaemon: cxl: Add support for memory soft PPR operation
  rasdaemon: cxl: Add storing memory repair needed info in the DRAM
    event record
  rasdaemon: cxl: Add CXL memory repair boot-up script for unrepaired
    memory errors

 misc/rasdaemon.env     |   4 +
 ras-cxl-handler.c      | 386 +++++++++++++++++++++++++++++++++++++++++
 ras-record.c           |   2 +
 ras-record.h           |   1 +
 util/cxl-mem-repair.sh | 189 ++++++++++++++++++++
 5 files changed, 582 insertions(+)
 create mode 100755 util/cxl-mem-repair.sh

-- 
2.43.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ