lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231129075034.2159223-1-muralimk@amd.com>
Date:   Wed, 29 Nov 2023 07:50:30 +0000
From:   Muralidhara M K <muralimk@....com>
To:     <linux-edac@...r.kernel.org>
CC:     <linux-kernel@...r.kernel.org>, <bp@...en8.de>,
        <mchehab@...nel.org>, Muralidhara M K <muralidhara.mk@....com>
Subject: [PATCH 0/4] Persist FRU memory poisons

From: Muralidhara M K <muralidhara.mk@....com>

This patch set is based on the patches submitted
https://lore.kernel.org/linux-edac/20231129073521.2127403-1-muralimk@amd.com/T/#t

MI300A has on-die HBMv3 memory embedded on to socket. Upon reaching threshold
of memory errors socket has to be replaced. Define the criteria to identify the
Field Replicable Unit(FRU) based on number of poisoned pages in the socket by
persisting them in a non-volatile storage.

Notifier is registered to handle the FRU memory poisons and poison count
incremented based on injected MCE errors until it reaches maximum number of
fru poison entries.
Sysfs entry per FRU will ease the use to look into the poison details.

During boot, Read the ERST records for identifying the poison address and
retire all system physical addresses in that HBM row.

Patch 1:
Add an API to get the maximum CPER record size to be stored in NV storage

Patch 2:
Add FRU memory poison module

Patch 3:
Add sysfs entry to print the required error information from poison records

Patch 4:
Add documentation on FRU memory poisons.

Muralidhara M K (4):
  ACPI/APEI: Add erst_get_size() API
  RAS/fmp: Add FRU memory poison CPER support for Error persistence
  EDAC/amd64: Add sysfs entry to read FRU poison data
  RAS/fmp: Add Documentation on Persistence of FRU memory poisons

 Documentation/RAS/ras.rst        | 122 +++++++
 MAINTAINERS                      |   8 +
 drivers/acpi/apei/erst.c         |   9 +
 drivers/edac/amd64_edac.c        |  25 ++
 drivers/ras/Kconfig              |   1 +
 drivers/ras/Makefile             |   1 +
 drivers/ras/fmp/Kconfig          |  18 +
 drivers/ras/fmp/Makefile         |  10 +
 drivers/ras/fmp/fru_mem_poison.c | 595 +++++++++++++++++++++++++++++++
 include/acpi/apei.h              |   1 +
 include/linux/cper.h             |  24 ++
 include/linux/fru_mem_poison.h   |  17 +
 12 files changed, 831 insertions(+)
 create mode 100644 drivers/ras/fmp/Kconfig
 create mode 100644 drivers/ras/fmp/Makefile
 create mode 100644 drivers/ras/fmp/fru_mem_poison.c
 create mode 100644 include/linux/fru_mem_poison.h

-- 
2.25.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ