[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <df8b3c3bffd24e1e8eb05b2ec53b3c58@huawei.com>
Date: Tue, 14 Jan 2025 12:31:44 +0000
From: Shiju Jose <shiju.jose@...wei.com>
To: Mauro Carvalho Chehab <mchehab+huawei@...nel.org>
CC: "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"linux-cxl@...r.kernel.org" <linux-cxl@...r.kernel.org>,
"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "bp@...en8.de" <bp@...en8.de>,
"tony.luck@...el.com" <tony.luck@...el.com>, "rafael@...nel.org"
<rafael@...nel.org>, "lenb@...nel.org" <lenb@...nel.org>,
"mchehab@...nel.org" <mchehab@...nel.org>, "dan.j.williams@...el.com"
<dan.j.williams@...el.com>, "dave@...olabs.net" <dave@...olabs.net>,
"Jonathan Cameron" <jonathan.cameron@...wei.com>, "dave.jiang@...el.com"
<dave.jiang@...el.com>, "alison.schofield@...el.com"
<alison.schofield@...el.com>, "vishal.l.verma@...el.com"
<vishal.l.verma@...el.com>, "ira.weiny@...el.com" <ira.weiny@...el.com>,
"david@...hat.com" <david@...hat.com>, "Vilas.Sridharan@....com"
<Vilas.Sridharan@....com>, "leo.duran@....com" <leo.duran@....com>,
"Yazen.Ghannam@....com" <Yazen.Ghannam@....com>, "rientjes@...gle.com"
<rientjes@...gle.com>, "jiaqiyan@...gle.com" <jiaqiyan@...gle.com>,
"Jon.Grimm@....com" <Jon.Grimm@....com>, "dave.hansen@...ux.intel.com"
<dave.hansen@...ux.intel.com>, "naoya.horiguchi@....com"
<naoya.horiguchi@....com>, "james.morse@....com" <james.morse@....com>,
"jthoughton@...gle.com" <jthoughton@...gle.com>, "somasundaram.a@....com"
<somasundaram.a@....com>, "erdemaktas@...gle.com" <erdemaktas@...gle.com>,
"pgonda@...gle.com" <pgonda@...gle.com>, "duenwen@...gle.com"
<duenwen@...gle.com>, "gthelen@...gle.com" <gthelen@...gle.com>,
"wschwartz@...erecomputing.com" <wschwartz@...erecomputing.com>,
"dferguson@...erecomputing.com" <dferguson@...erecomputing.com>,
"wbs@...amperecomputing.com" <wbs@...amperecomputing.com>,
"nifan.cxl@...il.com" <nifan.cxl@...il.com>, tanxiaofei
<tanxiaofei@...wei.com>, "Zengtao (B)" <prime.zeng@...ilicon.com>, "Roberto
Sassu" <roberto.sassu@...wei.com>, "kangkang.shen@...urewei.com"
<kangkang.shen@...urewei.com>, wanghuiqiang <wanghuiqiang@...wei.com>,
Linuxarm <linuxarm@...wei.com>
Subject: RE: [PATCH v18 04/19] EDAC: Add memory repair control feature
Hi Mauro,
Thanks for the comments.
>-----Original Message-----
>From: Mauro Carvalho Chehab <mchehab+huawei@...nel.org>
>Sent: 14 January 2025 11:48
>To: Shiju Jose <shiju.jose@...wei.com>
>Cc: linux-edac@...r.kernel.org; linux-cxl@...r.kernel.org; linux-
>acpi@...r.kernel.org; linux-mm@...ck.org; linux-kernel@...r.kernel.org;
>bp@...en8.de; tony.luck@...el.com; rafael@...nel.org; lenb@...nel.org;
>mchehab@...nel.org; dan.j.williams@...el.com; dave@...olabs.net; Jonathan
>Cameron <jonathan.cameron@...wei.com>; dave.jiang@...el.com;
>alison.schofield@...el.com; vishal.l.verma@...el.com; ira.weiny@...el.com;
>david@...hat.com; Vilas.Sridharan@....com; leo.duran@....com;
>Yazen.Ghannam@....com; rientjes@...gle.com; jiaqiyan@...gle.com;
>Jon.Grimm@....com; dave.hansen@...ux.intel.com;
>naoya.horiguchi@....com; james.morse@....com; jthoughton@...gle.com;
>somasundaram.a@....com; erdemaktas@...gle.com; pgonda@...gle.com;
>duenwen@...gle.com; gthelen@...gle.com;
>wschwartz@...erecomputing.com; dferguson@...erecomputing.com;
>wbs@...amperecomputing.com; nifan.cxl@...il.com; tanxiaofei
><tanxiaofei@...wei.com>; Zengtao (B) <prime.zeng@...ilicon.com>; Roberto
>Sassu <roberto.sassu@...wei.com>; kangkang.shen@...urewei.com;
>wanghuiqiang <wanghuiqiang@...wei.com>; Linuxarm
><linuxarm@...wei.com>
>Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
>
>Em Mon, 6 Jan 2025 12:10:00 +0000
><shiju.jose@...wei.com> escreveu:
>
>> From: Shiju Jose <shiju.jose@...wei.com>
>>
>> Add a generic EDAC memory repair control driver to manage memory repairs
>> in the system, such as CXL Post Package Repair (PPR) and CXL memory sparing
>> features.
>>
>> For example, a CXL device with DRAM components that support PPR features
>> may implement PPR maintenance operations. DRAM components may support
>two
>> types of PPR, hard PPR, for a permanent row repair, and soft PPR, for a
>> temporary row repair. Soft PPR is much faster than hard PPR, but the repair
>> is lost with a power cycle.
>> Similarly a CXL memory device may support soft and hard memory sparing at
>> cacheline, row, bank and rank granularities. Memory sparing is defined as
>> a repair function that replaces a portion of memory with a portion of
>> functional memory at that same granularity.
>> When a CXL device detects an error in a memory, it may report the host of
>> the need for a repair maintenance operation by using an event record where
>> the "maintenance needed" flag is set. The event records contains the device
>> physical address(DPA) and other attributes of the memory to repair (such as
>> channel, sub-channel, bank group, bank, rank, row, column etc). The kernel
>> will report the corresponding CXL general media or DRAM trace event to
>> userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
>> operation in response to the device request via the sysfs repair control.
>>
>> Device with memory repair features registers with EDAC device driver,
>> which retrieves memory repair descriptor from EDAC memory repair driver
>> and exposes the sysfs repair control attributes to userspace in
>> /sys/bus/edac/devices/<dev-name>/mem_repairX/.
>>
>> The common memory repair control interface abstracts the control of
>> arbitrary memory repair functionality into a standardized set of functions.
>> The sysfs memory repair attribute nodes are only available if the client
>> driver has implemented the corresponding attribute callback function and
>> provided operations to the EDAC device driver during registration.
>>
>> Signed-off-by: Shiju Jose <shiju.jose@...wei.com>
>> ---
>> .../ABI/testing/sysfs-edac-memory-repair | 244 +++++++++
>> Documentation/edac/features.rst | 3 +
>> Documentation/edac/index.rst | 1 +
>> Documentation/edac/memory_repair.rst | 101 ++++
>> drivers/edac/Makefile | 2 +-
>> drivers/edac/edac_device.c | 33 ++
>> drivers/edac/mem_repair.c | 492 ++++++++++++++++++
>> include/linux/edac.h | 139 +++++
>> 8 files changed, 1014 insertions(+), 1 deletion(-)
>> create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair
>> create mode 100644 Documentation/edac/memory_repair.rst
>> create mode 100755 drivers/edac/mem_repair.c
>>
>> diff --git a/Documentation/ABI/testing/sysfs-edac-memory-repair
>b/Documentation/ABI/testing/sysfs-edac-memory-repair
>> new file mode 100644
>> index 000000000000..e9268f3780ed
>> --- /dev/null
>> +++ b/Documentation/ABI/testing/sysfs-edac-memory-repair
>> @@ -0,0 +1,244 @@
>> +What: /sys/bus/edac/devices/<dev-name>/mem_repairX
>> +Date: Jan 2025
>> +KernelVersion: 6.14
>> +Contact: linux-edac@...r.kernel.org
>> +Description:
>> + The sysfs EDAC bus devices /<dev-name>/mem_repairX
>subdirectory
>> + pertains to the memory media repair features control, such as
>> + PPR (Post Package Repair), memory sparing etc, where<dev-
>name>
>> + directory corresponds to a device registered with the EDAC
>> + device driver for the memory repair features.
>> +
>> + Post Package Repair is a maintenance operation requests the
>memory
>> + device to perform a repair operation on its media, in detail is a
>> + memory self-healing feature that fixes a failing memory
>location by
>> + replacing it with a spare row in a DRAM device. For example, a
>> + CXL memory device with DRAM components that support PPR
>features may
>> + implement PPR maintenance operations. DRAM components
>may support
>> + two types of PPR functions: hard PPR, for a permanent row
>repair, and
>> + soft PPR, for a temporary row repair. soft PPR is much faster
>than
>> + hard PPR, but the repair is lost with a power cycle.
>> +
>> + Memory sparing is a repair function that replaces a portion
>> + of memory with a portion of functional memory at that same
>> + sparing granularity. Memory sparing has
>cacheline/row/bank/rank
>> + sparing granularities. For example, in memory-sparing mode,
>> + one memory rank serves as a spare for other ranks on the same
>> + channel in case they fail. The spare rank is held in reserve and
>> + not used as active memory until a failure is indicated, with
>> + reserved capacity subtracted from the total available memory
>> + in the system.The DIMM installation order for memory sparing
>> + varies based on the number of processors and memory modules
>> + installed in the server. After an error threshold is surpassed
>> + in a system protected by memory sparing, the content of a
>failing
>> + rank of DIMMs is copied to the spare rank. The failing rank is
>> + then taken offline and the spare rank placed online for use as
>> + active memory in place of the failed rank.
>> +
>> + The sysfs attributes nodes for a repair feature are only
>> + present if the parent driver has implemented the corresponding
>> + attr callback function and provided the necessary operations
>> + to the EDAC device driver during registration.
>> +
>> + In some states of system configuration (e.g. before address
>> + decoders have been configured), memory devices (e.g. CXL)
>> + may not have an active mapping in the main host address
>> + physical address map. As such, the memory to repair must be
>> + identified by a device specific physical addressing scheme
>> + using a device physical address(DPA). The DPA and other control
>> + attributes to use will be presented in related error records.
>> +
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair_function
>> +Date: Jan 2025
>> +KernelVersion: 6.14
>> +Contact: linux-edac@...r.kernel.org
>> +Description:
>> + (RO) Memory repair function type. For eg. post package repair,
>> + memory sparing etc.
>> + EDAC_SOFT_PPR - Soft post package repair
>> + EDAC_HARD_PPR - Hard post package repair
>> + EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing
>> + EDAC_ROW_MEM_SPARING - Row memory sparing
>> + EDAC_BANK_MEM_SPARING - Bank memory sparing
>> + EDAC_RANK_MEM_SPARING - Rank memory sparing
>> + All other values are reserved.
>
>Too big strings. Why are them in upper cases? IMO:
>
> soft-ppr, hard-ppr, ... would be enough.
>
Here return repair type (single value, such as 0, 1, or 2 etc not as decoded string for eg."EDAC_SOFT_PPR")
of the memory repair instance, which is defined as enums (EDAC_SOFT_PPR, EDAC_HARD_PPR, ... etc)
for the memory repair interface in the include/linux/edac.h.
enum edac_mem_repair_function {
EDAC_SOFT_PPR,
EDAC_HARD_PPR,
EDAC_CACHELINE_MEM_SPARING,
EDAC_ROW_MEM_SPARING,
EDAC_BANK_MEM_SPARING,
EDAC_RANK_MEM_SPARING,
};
I documented return value in terms of the above enums.
>Also, Is it mandatory that all types are supported? If not, you need a
>way to report to userspace what of them are supported. One option
>would be that reading /sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair_function
>would return something like:
>
> soft-ppr [hard-ppr] row-mem-sparing
>
Same as above. It is not returned in the decoded string format.
>Also, as this will be parsed in ReST format, you need to change the
>description to use bullets, otherwise the html/pdf version of the
>document will place everything on a single line. E.g. something like:
Sure.
>
>Description:
> (RO) Memory repair function type. For eg. post package repair,
> memory sparing etc. Can be:
>
> - EDAC_SOFT_PPR - Soft post package repair
> - EDAC_HARD_PPR - Hard post package repair
> - EDAC_CACHELINE_MEM_SPARING - Cacheline memory
>sparing
> - EDAC_ROW_MEM_SPARING - Row memory sparing
> - EDAC_BANK_MEM_SPARING - Bank memory sparing
> - EDAC_RANK_MEM_SPARING - Rank memory sparing
> - All other values are reserved.
>
>Same applies to other sysfs nodes. See for instance:
>
> Documentation/ABI/stable/sysfs-class-backlight
>
>And see how it is formatted after Sphinx processing at the Kernel
>Admin guide:
>
> https://www.kernel.org/doc/html/latest/admin-guide/abi-
>stable.html#symbols-under-sys-class
>
>Please fix it on all places you have a list of values.
Sure.
>
>> +
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/persist_mode
>> +Date: Jan 2025
>> +KernelVersion: 6.14
>> +Contact: linux-edac@...r.kernel.org
>> +Description:
>> + (RW) Read/Write the current persist repair mode set for a
>> + repair function. Persist repair modes supported in the
>> + device, based on the memory repair function is temporary
>> + or permanent and is lost with a power cycle.
>> + EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary
>repair).
>> + EDAC_MEM_REPAIR_HARD - Hard memory repair function
>(permanent repair).
>> + All other values are reserved.
>
>Same here: edac/ is already in the path. No need to place EDAC_ at the name.
>
Sam as above. Return a single value, not as decoded string. But documented in terms
of the enums defined for interface in the include/linux/edac.h
>> +
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/dpa_support
>> +Date: Jan 2025
>> +KernelVersion: 6.14
>> +Contact: linux-edac@...r.kernel.org
>> +Description:
>> + (RO) True if memory device required device physical
>> + address (DPA) of memory to repair.
>> + False if memory device required host specific physical
>> + address (HPA) of memory to repair.
>
>Please remove the extra spaces before "address", as otherwise conversion to
>ReST may do the wrong thing or may produce doc warnings.
Will fix.
>
>> + In some states of system configuration (e.g. before address
>> + decoders have been configured), memory devices (e.g. CXL)
>> + may not have an active mapping in the main host address
>> + physical address map. As such, the memory to repair must be
>> + identified by a device specific physical addressing scheme
>> + using a DPA. The device physical address(DPA) to use will be
>> + presented in related error records.
>> +
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair_safe_when_in_use
>> +Date: Jan 2025
>> +KernelVersion: 6.14
>> +Contact: linux-edac@...r.kernel.org
>> +Description:
>> + (RO) True if memory media is accessible and data is retained
>> + during the memory repair operation.
>> + The data may not be retained and memory requests may not be
>> + correctly processed during a repair operation. In such case
>> + the repair operation should not executed at runtime.
>
>Please add an extra line before "The data" to ensure that the output at
>the admin-guide won't merge the two paragraphs. Same on other places along
>this patch series: paragraphs need a blank line at the description.
>
Sure.
>> +
>> +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/hpa
>> +Date: Jan 2025
>> +KernelVersion: 6.14
>> +Contact: linux-edac@...r.kernel.org
>> +Description:
>> + (RW) Host Physical Address (HPA) of the memory to repair.
>> + See attribute 'dpa_support' for more details.
>> + The HPA to use will be provided in related error records.
>> +
>> +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/dpa
>> +Date: Jan 2025
>> +KernelVersion: 6.14
>> +Contact: linux-edac@...r.kernel.org
>> +Description:
>> + (RW) Device Physical Address (DPA) of the memory to repair.
>> + See attribute 'dpa_support' for more details.
>> + The specific DPA to use will be provided in related error
>> + records.
>> +
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/nibble_mask
>> +Date: Jan 2025
>> +KernelVersion: 6.14
>> +Contact: linux-edac@...r.kernel.org
>> +Description:
>> + (RW) Read/Write Nibble mask of the memory to repair.
>> + Nibble mask identifies one or more nibbles in error on the
>> + memory bus that produced the error event. Nibble Mask bit 0
>> + shall be set if nibble 0 on the memory bus produced the
>> + event, etc. For example, CXL PPR and sparing, a nibble mask
>> + bit set to 1 indicates the request to perform repair
>> + operation in the specific device. All nibble mask bits set
>> + to 1 indicates the request to perform the operation in all
>> + devices. For CXL memory to repiar, the specific value of
>> + nibble mask to use will be provided in related error records.
>> + For more details, See nibble mask field in CXL spec ver 3.1,
>> + section 8.2.9.7.1.2 Table 8-103 soft PPR and section
>> + 8.2.9.7.1.3 Table 8-104 hard PPR, section 8.2.9.7.1.4
>> + Table 8-105 memory sparing.
>> +
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/bank_group
>> +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
>> +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/rank
>> +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/row
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/column
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/channel
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/sub_channel
>> +Date: Jan 2025
>> +KernelVersion: 6.14
>> +Contact: linux-edac@...r.kernel.org
>> +Description:
>> + (RW) The control attributes associated with memory address
>> + that is to be repaired. The specific value of attributes to
>> + use depends on the portion of memory to repair and may be
>> + reported to host in related error records and may be
>> + available to userspace in trace events, such as in CXL
>> + memory devices.
>> +
>> + channel - The channel of the memory to repair. Channel is
>> + defined as an interface that can be independently accessed
>> + for a transaction.
>> + rank - The rank of the memory to repair. Rank is defined as a
>> + set of memory devices on a channel that together execute a
>> + transaction.
>> + bank_group - The bank group of the memory to repair.
>> + bank - The bank number of the memory to repair.
>> + row - The row number of the memory to repair.
>> + column - The column number of the memory to repair.
>> + sub_channel - The sub-channel of the memory to repair.
>
>Same problem here with regards to bad ReST input. I would do:
>
> channel
> The channel of the memory to repair. Channel is
> defined as an interface that can be independently accessed
> for a transaction.
>
> rank
> The rank of the memory to repair. Rank is defined as a
> set of memory devices on a channel that together execute a
> transaction.
>
Sure. Will fix.
>as this would provide a better output at admin-guide while still being
>nicer to read as text.
>
>> +
>> + The requirement to set these attributes varies based on the
>> + repair function. The attributes in sysfs are not present
>> + unless required for a repair function.
>> + For example, CXL spec ver 3.1, Section 8.2.9.7.1.2 Table 8-103
>> + soft PPR and Section 8.2.9.7.1.3 Table 8-104 hard PPR
>operations,
>> + these attributes are not required to set.
>> + For example, CXL spec ver 3.1, Section 8.2.9.7.1.4 Table 8-105
>> + memory sparing, these attributes are required to set based on
>> + memory sparing granularity as follows.
>> + Channel: Channel associated with the DPA that is to be spared
>> + and applies to all subclasses of sparing (cacheline, bank,
>> + row and rank sparing).
>> + Rank: Rank associated with the DPA that is to be spared and
>> + applies to all subclasses of sparing.
>> + Bank & Bank Group: Bank & bank group are associated with
>> + the DPA that is to be spared and applies to cacheline sparing,
>> + row sparing and bank sparing subclasses.
>> + Row: Row associated with the DPA that is to be spared and
>> + applies to cacheline sparing and row sparing subclasses.
>> + Column: column associated with the DPA that is to be spared
>> + and applies to cacheline sparing only.
>> + Sub-channel: sub-channel associated with the DPA that is to
>> + be spared and applies to cacheline sparing only.
>
>Same here: this will all be on a single paragraph which would be really
>weird.
Will fix.
>
>> +
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_hpa
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_dpa
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_nibble_mask
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_bank_group
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_bank
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_rank
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_row
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_column
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_channel
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_sub_channel
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_hpa
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_dpa
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_nibble_mask
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_bank_group
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_bank
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_rank
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_row
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_column
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_channel
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_sub_channel
>> +Date: Jan 2025
>> +KernelVersion: 6.14
>> +Contact: linux-edac@...r.kernel.org
>> +Description:
>> + (RW) The supported range of control attributes (optional)
>> + associated with a memory address that is to be repaired.
>> + The memory device may give the supported range of
>> + attributes to use and it will depend on the memory device
>> + and the portion of memory to repair.
>> + The userspace may receive the specific value of attributes
>> + to use for a repair operation from the memory device via
>> + related error records and trace events, such as in CXL
>> + memory devices.
>> +
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair
>> +Date: Jan 2025
>> +KernelVersion: 6.14
>> +Contact: linux-edac@...r.kernel.org
>> +Description:
>> + (WO) Issue the memory repair operation for the specified
>> + memory repair attributes. The operation may fail if resources
>> + are insufficient based on the requirements of the memory
>> + device and repair function.
>> + EDAC_DO_MEM_REPAIR - issue repair operation.
>> + All other values are reserved.
>> diff --git a/Documentation/edac/features.rst
>b/Documentation/edac/features.rst
>> index ba3ab993ee4f..bfd5533b81b7 100644
>> --- a/Documentation/edac/features.rst
>> +++ b/Documentation/edac/features.rst
>> @@ -97,3 +97,6 @@ RAS features
>> ------------
>> 1. Memory Scrub
>> Memory scrub features are documented in `Documentation/edac/scrub.rst`.
>> +
>> +2. Memory Repair
>> +Memory repair features are documented in
>`Documentation/edac/memory_repair.rst`.
>> diff --git a/Documentation/edac/index.rst b/Documentation/edac/index.rst
>> index dfb0c9fb9ab1..d6778f4562dd 100644
>> --- a/Documentation/edac/index.rst
>> +++ b/Documentation/edac/index.rst
>> @@ -8,4 +8,5 @@ EDAC Subsystem
>> :maxdepth: 1
>>
>> features
>> + memory_repair
>> scrub
>> diff --git a/Documentation/edac/memory_repair.rst
>b/Documentation/edac/memory_repair.rst
>> new file mode 100644
>> index 000000000000..2787a8a2d6ba
>> --- /dev/null
>> +++ b/Documentation/edac/memory_repair.rst
>> @@ -0,0 +1,101 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +==========================
>> +EDAC Memory Repair Control
>> +==========================
>> +
>> +Copyright (c) 2024 HiSilicon Limited.
>> +
>> +:Author: Shiju Jose <shiju.jose@...wei.com>
>> +:License: The GNU Free Documentation License, Version 1.2
>> + (dual licensed under the GPL v2)
>> +:Original Reviewers:
>> +
>> +- Written for: 6.14
>
>See my comments with regards to license on the previous patches.
Ok.
>
>> +
>> +Introduction
>> +------------
>> +Memory devices may support repair operations to address issues in their
>> +memory media. Post Package Repair (PPR) and memory sparing are
>examples
>> +of such features.
>> +
>> +Post Package Repair(PPR)
>> +~~~~~~~~~~~~~~~~~~~~~~~~
>> +Post Package Repair is a maintenance operation requests the memory device
>> +to perform repair operation on its media, in detail is a memory self-healing
>> +feature that fixes a failing memory location by replacing it with a spare
>> +row in a DRAM device. For example, a CXL memory device with DRAM
>components
>> +that support PPR features may implement PPR maintenance operations.
>DRAM
>> +components may support types of PPR functions, hard PPR, for a permanent
>row
>> +repair, and soft PPR, for a temporary row repair. Soft PPR is much faster
>> +than hard PPR, but the repair is lost with a power cycle. The data may not
>> +be retained and memory requests may not be correctly processed during a
>> +repair operation. In such case, the repair operation should not executed
>> +at runtime.
>> +For example, CXL memory devices, soft PPR and hard PPR repair operations
>> +may be supported. See CXL spec rev 3.1 sections 8.2.9.7.1.1 PPR Maintenance
>> +Operations, 8.2.9.7.1.2 sPPR Maintenance Operation and 8.2.9.7.1.3 hPPR
>> +Maintenance Operation for more details.
>
>Paragraphs require blank lines in ReST. Also, please place a link to the
>specs.
>
>I strongly suggest looking at the output of all docs with make htmldocs
>and make pdfdocs to be sure that the paragraphs and the final document
>will be properly handled. You may use:
>
> SPHINXDIRS="<book name(s)>"
>
>to speed-up documentation builds.
>
>Please see Sphinx documentation for more details about what it is expected
>there:
>
> https://www.sphinx-
>doc.org/en/master/usage/restructuredtext/basics.html
Thanks for information. I will check and fix.
I had fixed blank line requirements in most of the main documentations,
but was not aware of location of output for the ABI docs and missed.
>
>> +
>> +Memory Sparing
>> +~~~~~~~~~~~~~~
>> +Memory sparing is a repair function that replaces a portion of memory with
>> +a portion of functional memory at that same sparing granularity. Memory
>> +sparing has cacheline/row/bank/rank sparing granularities. For example, in
>> +memory-sparing mode, one memory rank serves as a spare for other ranks
>on
>> +the same channel in case they fail. The spare rank is held in reserve and
>> +not used as active memory until a failure is indicated, with reserved
>> +capacity subtracted from the total available memory in the system. The
>DIMM
>> +installation order for memory sparing varies based on the number of
>processors
>> +and memory modules installed in the server. After an error threshold is
>> +surpassed in a system protected by memory sparing, the content of a failing
>> +rank of DIMMs is copied to the spare rank. The failing rank is then taken
>> +offline and the spare rank placed online for use as active memory in place
>> +of the failed rank.
>> +
>> +For example, CXL memory devices may support various subclasses for sparing
>> +operation vary in terms of the scope of the sparing being performed.
>> +Cacheline sparing subclass refers to a sparing action that can replace a
>> +full cacheline. Row sparing is provided as an alternative to PPR sparing
>> +functions and its scope is that of a single DDR row. Bank sparing allows
>> +an entire bank to be replaced. Rank sparing is defined as an operation
>> +in which an entire DDR rank is replaced. See CXL spec 3.1 section
>> +8.2.9.7.1.4 Memory Sparing Maintenance Operations for more details.
>> +
>> +Use cases of generic memory repair features control
>> +---------------------------------------------------
>> +
>> +1. The soft PPR , hard PPR and memory-sparing features share similar
>> +control attributes. Therefore, there is a need for a standardized, generic
>> +sysfs repair control that is exposed to userspace and used by
>> +administrators, scripts and tools.
>> +
>> +2. When a CXL device detects an error in a memory component, it may
>inform
>> +the host of the need for a repair maintenance operation by using an event
>> +record where the "maintenance needed" flag is set. The event record
>> +specifies the device physical address(DPA) and attributes of the memory that
>> +requires repair. The kernel reports the corresponding CXL general media or
>> +DRAM trace event to userspace, and userspace tools (e.g. rasdaemon)
>initiate
>> +a repair maintenance operation in response to the device request using the
>> +sysfs repair control.
>> +
>> +3. Userspace tools, such as rasdaemon, may request a PPR/sparing on a
>memory
>> +region when an uncorrected memory error or an excess of corrected
>memory
>> +errors is reported on that memory.
>> +
>> +4. Multiple PPR/sparing instances may be present per memory device.
>> +
>> +The File System
>> +---------------
>> +
>> +The control attributes of a registered memory repair instance could be
>> +accessed in the
>> +
>> +/sys/bus/edac/devices/<dev-name>/mem_repairX/
>> +
>> +sysfs
>> +-----
>> +
>> +Sysfs files are documented in
>> +
>> +`Documentation/ABI/testing/sysfs-edac-memory-repair`.
>> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
>> index 3a49304860f0..1de9fe66ac6b 100644
>> --- a/drivers/edac/Makefile
>> +++ b/drivers/edac/Makefile
>> @@ -10,7 +10,7 @@ obj-$(CONFIG_EDAC) := edac_core.o
>>
>> edac_core-y := edac_mc.o edac_device.o edac_mc_sysfs.o
>> edac_core-y += edac_module.o edac_device_sysfs.o wq.o
>> -edac_core-y += scrub.o ecs.o
>> +edac_core-y += scrub.o ecs.o mem_repair.o
>>
>> edac_core-$(CONFIG_EDAC_DEBUG) += debugfs.o
>>
>> diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
>> index 1c1142a2e4e4..a401d81dad8a 100644
>> --- a/drivers/edac/edac_device.c
>> +++ b/drivers/edac/edac_device.c
>> @@ -575,6 +575,7 @@ static void edac_dev_release(struct device *dev)
>> {
>> struct edac_dev_feat_ctx *ctx = container_of(dev, struct
>edac_dev_feat_ctx, dev);
>>
>> + kfree(ctx->mem_repair);
>> kfree(ctx->scrub);
>> kfree(ctx->dev.groups);
>> kfree(ctx);
>> @@ -611,6 +612,7 @@ int edac_dev_register(struct device *parent, char
>*name,
>> const struct attribute_group **ras_attr_groups;
>> struct edac_dev_data *dev_data;
>> struct edac_dev_feat_ctx *ctx;
>> + int mem_repair_cnt = 0;
>> int attr_gcnt = 0;
>> int scrub_cnt = 0;
>> int ret, feat;
>> @@ -628,6 +630,10 @@ int edac_dev_register(struct device *parent, char
>*name,
>> case RAS_FEAT_ECS:
>> attr_gcnt +=
>ras_features[feat].ecs_info.num_media_frus;
>> break;
>> + case RAS_FEAT_MEM_REPAIR:
>> + attr_gcnt++;
>> + mem_repair_cnt++;
>> + break;
>> default:
>> return -EINVAL;
>> }
>> @@ -651,8 +657,17 @@ int edac_dev_register(struct device *parent, char
>*name,
>> }
>> }
>>
>> + if (mem_repair_cnt) {
>> + ctx->mem_repair = kcalloc(mem_repair_cnt, sizeof(*ctx-
>>mem_repair), GFP_KERNEL);
>> + if (!ctx->mem_repair) {
>> + ret = -ENOMEM;
>> + goto data_mem_free;
>> + }
>> + }
>> +
>> attr_gcnt = 0;
>> scrub_cnt = 0;
>> + mem_repair_cnt = 0;
>> for (feat = 0; feat < num_features; feat++, ras_features++) {
>> switch (ras_features->ft_type) {
>> case RAS_FEAT_SCRUB:
>> @@ -686,6 +701,23 @@ int edac_dev_register(struct device *parent, char
>*name,
>>
>> attr_gcnt += ras_features->ecs_info.num_media_frus;
>> break;
>> + case RAS_FEAT_MEM_REPAIR:
>> + if (!ras_features->mem_repair_ops ||
>> + mem_repair_cnt != ras_features->instance)
>> + goto data_mem_free;
>> +
>> + dev_data = &ctx->mem_repair[mem_repair_cnt];
>> + dev_data->instance = mem_repair_cnt;
>> + dev_data->mem_repair_ops = ras_features-
>>mem_repair_ops;
>> + dev_data->private = ras_features->ctx;
>> + ret = edac_mem_repair_get_desc(parent,
>&ras_attr_groups[attr_gcnt],
>> + ras_features->instance);
>> + if (ret)
>> + goto data_mem_free;
>> +
>> + mem_repair_cnt++;
>> + attr_gcnt++;
>> + break;
>> default:
>> ret = -EINVAL;
>> goto data_mem_free;
>> @@ -712,6 +744,7 @@ int edac_dev_register(struct device *parent, char
>*name,
>> return devm_add_action_or_reset(parent, edac_dev_unreg, &ctx->dev);
>>
>> data_mem_free:
>> + kfree(ctx->mem_repair);
>> kfree(ctx->scrub);
>> groups_free:
>> kfree(ras_attr_groups);
>> diff --git a/drivers/edac/mem_repair.c b/drivers/edac/mem_repair.c
>> new file mode 100755
>> index 000000000000..e7439fd26c41
>> --- /dev/null
>> +++ b/drivers/edac/mem_repair.c
>> @@ -0,0 +1,492 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * The generic EDAC memory repair driver is designed to control the memory
>> + * devices with memory repair features, such as Post Package Repair (PPR),
>> + * memory sparing etc. The common sysfs memory repair interface abstracts
>> + * the control of various arbitrary memory repair functionalities into a
>> + * unified set of functions.
>> + *
>> + * Copyright (c) 2024 HiSilicon Limited.
>> + */
>> +
>> +#include <linux/edac.h>
>> +
>> +enum edac_mem_repair_attributes {
>> + MEM_REPAIR_FUNCTION,
>> + MEM_REPAIR_PERSIST_MODE,
>> + MEM_REPAIR_DPA_SUPPORT,
>> + MEM_REPAIR_SAFE_IN_USE,
>> + MEM_REPAIR_HPA,
>> + MEM_REPAIR_MIN_HPA,
>> + MEM_REPAIR_MAX_HPA,
>> + MEM_REPAIR_DPA,
>> + MEM_REPAIR_MIN_DPA,
>> + MEM_REPAIR_MAX_DPA,
>> + MEM_REPAIR_NIBBLE_MASK,
>> + MEM_REPAIR_MIN_NIBBLE_MASK,
>> + MEM_REPAIR_MAX_NIBBLE_MASK,
>> + MEM_REPAIR_BANK_GROUP,
>> + MEM_REPAIR_MIN_BANK_GROUP,
>> + MEM_REPAIR_MAX_BANK_GROUP,
>> + MEM_REPAIR_BANK,
>> + MEM_REPAIR_MIN_BANK,
>> + MEM_REPAIR_MAX_BANK,
>> + MEM_REPAIR_RANK,
>> + MEM_REPAIR_MIN_RANK,
>> + MEM_REPAIR_MAX_RANK,
>> + MEM_REPAIR_ROW,
>> + MEM_REPAIR_MIN_ROW,
>> + MEM_REPAIR_MAX_ROW,
>> + MEM_REPAIR_COLUMN,
>> + MEM_REPAIR_MIN_COLUMN,
>> + MEM_REPAIR_MAX_COLUMN,
>> + MEM_REPAIR_CHANNEL,
>> + MEM_REPAIR_MIN_CHANNEL,
>> + MEM_REPAIR_MAX_CHANNEL,
>> + MEM_REPAIR_SUB_CHANNEL,
>> + MEM_REPAIR_MIN_SUB_CHANNEL,
>> + MEM_REPAIR_MAX_SUB_CHANNEL,
>> + MEM_DO_REPAIR,
>> + MEM_REPAIR_MAX_ATTRS
>> +};
>> +
>> +struct edac_mem_repair_dev_attr {
>> + struct device_attribute dev_attr;
>> + u8 instance;
>> +};
>> +
>> +struct edac_mem_repair_context {
>> + char name[EDAC_FEAT_NAME_LEN];
>> + struct edac_mem_repair_dev_attr
>mem_repair_dev_attr[MEM_REPAIR_MAX_ATTRS];
>> + struct attribute *mem_repair_attrs[MEM_REPAIR_MAX_ATTRS + 1];
>> + struct attribute_group group;
>> +};
>> +
>> +#define TO_MEM_REPAIR_DEV_ATTR(_dev_attr) \
>> + container_of(_dev_attr, struct edac_mem_repair_dev_attr,
>dev_attr)
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_SHOW(attrib, cb, type, format)
> \
>> +static ssize_t attrib##_show(struct device *ras_feat_dev,
> \
>> + struct device_attribute *attr, char *buf)
> \
>> +{
> \
>> + u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;
> \
>> + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
> \
>> + const struct edac_mem_repair_ops *ops =
> \
>> + ctx->mem_repair[inst].mem_repair_ops;
> \
>> + type data;
> \
>> + int ret; \
>> +
> \
>> + ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private,
> \
>> + &data);
> \
>> + if (ret) \
>> + return ret;
> \
>> +
> \
>> + return sysfs_emit(buf, format, data);
> \
>> +}
>> +
>> +EDAC_MEM_REPAIR_ATTR_SHOW(repair_function, get_repair_function,
>u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(persist_mode, get_persist_mode, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(dpa_support, get_dpa_support, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(repair_safe_when_in_use,
>get_repair_safe_when_in_use, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(hpa, get_hpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_hpa, get_min_hpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_hpa, get_max_hpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(dpa, get_dpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_dpa, get_min_dpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_dpa, get_max_dpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(nibble_mask, get_nibble_mask, u64,
>"0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_nibble_mask, get_min_nibble_mask,
>u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_nibble_mask,
>get_max_nibble_mask, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(bank_group, get_bank_group, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank_group, get_min_bank_group,
>u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank_group, get_max_bank_group,
>u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(bank, get_bank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank, get_min_bank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank, get_max_bank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(rank, get_rank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_rank, get_min_rank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_rank, get_max_rank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(row, get_row, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_row, get_min_row, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_row, get_max_row, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(column, get_column, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_column, get_min_column, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_column, get_max_column, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(channel, get_channel, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_channel, get_min_channel, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_channel, get_max_channel, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(sub_channel, get_sub_channel, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_sub_channel, get_min_sub_channel,
>u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_sub_channel,
>get_max_sub_channel, u32, "%u\n")
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_STORE(attrib, cb, type, conv_func)
> \
>> +static ssize_t attrib##_store(struct device *ras_feat_dev,
> \
>> + struct device_attribute *attr,
> \
>> + const char *buf, size_t len) \
>> +{
> \
>> + u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;
> \
>> + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
> \
>> + const struct edac_mem_repair_ops *ops =
> \
>> + ctx->mem_repair[inst].mem_repair_ops;
> \
>> + type data;
> \
>> + int ret; \
>> +
> \
>> + ret = conv_func(buf, 0, &data);
> \
>> + if (ret < 0)
> \
>> + return ret;
> \
>> +
> \
>> + ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private,
> \
>> + data);
> \
>> + if (ret) \
>> + return ret;
> \
>> +
> \
>> + return len;
> \
>> +}
>> +
>> +EDAC_MEM_REPAIR_ATTR_STORE(persist_mode, set_persist_mode,
>unsigned long, kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(hpa, set_hpa, u64, kstrtou64)
>> +EDAC_MEM_REPAIR_ATTR_STORE(dpa, set_dpa, u64, kstrtou64)
>> +EDAC_MEM_REPAIR_ATTR_STORE(nibble_mask, set_nibble_mask, u64,
>kstrtou64)
>> +EDAC_MEM_REPAIR_ATTR_STORE(bank_group, set_bank_group, unsigned
>long, kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(bank, set_bank, unsigned long, kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(rank, set_rank, unsigned long, kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(row, set_row, u64, kstrtou64)
>> +EDAC_MEM_REPAIR_ATTR_STORE(column, set_column, unsigned long,
>kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(channel, set_channel, unsigned long,
>kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(sub_channel, set_sub_channel, unsigned
>long, kstrtoul)
>> +
>> +#define EDAC_MEM_REPAIR_DO_OP(attrib, cb)
> \
>> +static ssize_t attrib##_store(struct device *ras_feat_dev,
> \
>> + struct device_attribute *attr,
> \
>> + const char *buf, size_t len)
> \
>> +{
> \
>> + u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;
> \
>> + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
> \
>> + const struct edac_mem_repair_ops *ops = ctx-
>>mem_repair[inst].mem_repair_ops; \
>> + unsigned long data;
> \
>> + int ret;
> \
>> +
> \
>> + ret = kstrtoul(buf, 0, &data);
> \
>> + if (ret < 0)
> \
>> + return ret;
> \
>> +
> \
>> + ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private,
>data); \
>> + if (ret)
> \
>> + return ret;
> \
>> +
> \
>> + return len;
> \
>> +}
>> +
>> +EDAC_MEM_REPAIR_DO_OP(repair, do_repair)
>> +
>> +static umode_t mem_repair_attr_visible(struct kobject *kobj, struct attribute
>*a, int attr_id)
>> +{
>> + struct device *ras_feat_dev = kobj_to_dev(kobj);
>> + struct device_attribute *dev_attr = container_of(a, struct
>device_attribute, attr);
>> + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
>> + u8 inst = TO_MEM_REPAIR_DEV_ATTR(dev_attr)->instance;
>> + const struct edac_mem_repair_ops *ops = ctx-
>>mem_repair[inst].mem_repair_ops;
>> +
>> + switch (attr_id) {
>> + case MEM_REPAIR_FUNCTION:
>> + if (ops->get_repair_function)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_PERSIST_MODE:
>> + if (ops->get_persist_mode) {
>> + if (ops->set_persist_mode)
>> + return a->mode;
>> + else
>> + return 0444;
>> + }
>> + break;
>> + case MEM_REPAIR_DPA_SUPPORT:
>> + if (ops->get_dpa_support)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_SAFE_IN_USE:
>> + if (ops->get_repair_safe_when_in_use)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_HPA:
>> + if (ops->get_hpa) {
>> + if (ops->set_hpa)
>> + return a->mode;
>> + else
>> + return 0444;
>> + }
>> + break;
>> + case MEM_REPAIR_MIN_HPA:
>> + if (ops->get_min_hpa)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_MAX_HPA:
>> + if (ops->get_max_hpa)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_DPA:
>> + if (ops->get_dpa) {
>> + if (ops->set_dpa)
>> + return a->mode;
>> + else
>> + return 0444;
>> + }
>> + break;
>> + case MEM_REPAIR_MIN_DPA:
>> + if (ops->get_min_dpa)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_MAX_DPA:
>> + if (ops->get_max_dpa)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_NIBBLE_MASK:
>> + if (ops->get_nibble_mask) {
>> + if (ops->set_nibble_mask)
>> + return a->mode;
>> + else
>> + return 0444;
>> + }
>> + break;
>> + case MEM_REPAIR_MIN_NIBBLE_MASK:
>> + if (ops->get_min_nibble_mask)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_MAX_NIBBLE_MASK:
>> + if (ops->get_max_nibble_mask)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_BANK_GROUP:
>> + if (ops->get_bank_group) {
>> + if (ops->set_bank_group)
>> + return a->mode;
>> + else
>> + return 0444;
>> + }
>> + break;
>> + case MEM_REPAIR_MIN_BANK_GROUP:
>> + if (ops->get_min_bank_group)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_MAX_BANK_GROUP:
>> + if (ops->get_max_bank_group)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_BANK:
>> + if (ops->get_bank) {
>> + if (ops->set_bank)
>> + return a->mode;
>> + else
>> + return 0444;
>> + }
>> + break;
>> + case MEM_REPAIR_MIN_BANK:
>> + if (ops->get_min_bank)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_MAX_BANK:
>> + if (ops->get_max_bank)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_RANK:
>> + if (ops->get_rank) {
>> + if (ops->set_rank)
>> + return a->mode;
>> + else
>> + return 0444;
>> + }
>> + break;
>> + case MEM_REPAIR_MIN_RANK:
>> + if (ops->get_min_rank)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_MAX_RANK:
>> + if (ops->get_max_rank)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_ROW:
>> + if (ops->get_row) {
>> + if (ops->set_row)
>> + return a->mode;
>> + else
>> + return 0444;
>> + }
>> + break;
>> + case MEM_REPAIR_MIN_ROW:
>> + if (ops->get_min_row)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_MAX_ROW:
>> + if (ops->get_max_row)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_COLUMN:
>> + if (ops->get_column) {
>> + if (ops->set_column)
>> + return a->mode;
>> + else
>> + return 0444;
>> + }
>> + break;
>> + case MEM_REPAIR_MIN_COLUMN:
>> + if (ops->get_min_column)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_MAX_COLUMN:
>> + if (ops->get_max_column)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_CHANNEL:
>> + if (ops->get_channel) {
>> + if (ops->set_channel)
>> + return a->mode;
>> + else
>> + return 0444;
>> + }
>> + break;
>> + case MEM_REPAIR_MIN_CHANNEL:
>> + if (ops->get_min_channel)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_MAX_CHANNEL:
>> + if (ops->get_max_channel)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_SUB_CHANNEL:
>> + if (ops->get_sub_channel) {
>> + if (ops->set_sub_channel)
>> + return a->mode;
>> + else
>> + return 0444;
>> + }
>> + break;
>> + case MEM_REPAIR_MIN_SUB_CHANNEL:
>> + if (ops->get_min_sub_channel)
>> + return a->mode;
>> + break;
>> + case MEM_REPAIR_MAX_SUB_CHANNEL:
>> + if (ops->get_max_sub_channel)
>> + return a->mode;
>> + break;
>> + case MEM_DO_REPAIR:
>> + if (ops->do_repair)
>> + return a->mode;
>> + break;
>> + default:
>> + break;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_RO(_name, _instance) \
>> + ((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RO(_name),
>\
>> + .instance = _instance })
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_WO(_name, _instance) \
>> + ((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_WO(_name),
>\
>> + .instance = _instance })
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_RW(_name, _instance) \
>> + ((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RW(_name),
>\
>> + .instance = _instance })
>> +
>> +static int mem_repair_create_desc(struct device *dev,
>> + const struct attribute_group **attr_groups,
>> + u8 instance)
>> +{
>> + struct edac_mem_repair_context *ctx;
>> + struct attribute_group *group;
>> + int i;
>> + struct edac_mem_repair_dev_attr dev_attr[] = {
>> + [MEM_REPAIR_FUNCTION] =
>EDAC_MEM_REPAIR_ATTR_RO(repair_function,
>> + instance),
>> + [MEM_REPAIR_PERSIST_MODE] =
>> + EDAC_MEM_REPAIR_ATTR_RW(persist_mode,
>instance),
>> + [MEM_REPAIR_DPA_SUPPORT] =
>> + EDAC_MEM_REPAIR_ATTR_RO(dpa_support,
>instance),
>> + [MEM_REPAIR_SAFE_IN_USE] =
>> +
> EDAC_MEM_REPAIR_ATTR_RO(repair_safe_when_in_use,
>> + instance),
>> + [MEM_REPAIR_HPA] = EDAC_MEM_REPAIR_ATTR_RW(hpa,
>instance),
>> + [MEM_REPAIR_MIN_HPA] =
>EDAC_MEM_REPAIR_ATTR_RO(min_hpa, instance),
>> + [MEM_REPAIR_MAX_HPA] =
>EDAC_MEM_REPAIR_ATTR_RO(max_hpa, instance),
>> + [MEM_REPAIR_DPA] = EDAC_MEM_REPAIR_ATTR_RW(dpa,
>instance),
>> + [MEM_REPAIR_MIN_DPA] =
>EDAC_MEM_REPAIR_ATTR_RO(min_dpa, instance),
>> + [MEM_REPAIR_MAX_DPA] =
>EDAC_MEM_REPAIR_ATTR_RO(max_dpa, instance),
>> + [MEM_REPAIR_NIBBLE_MASK] =
>> + EDAC_MEM_REPAIR_ATTR_RW(nibble_mask,
>instance),
>> + [MEM_REPAIR_MIN_NIBBLE_MASK] =
>> +
> EDAC_MEM_REPAIR_ATTR_RO(min_nibble_mask, instance),
>> + [MEM_REPAIR_MAX_NIBBLE_MASK] =
>> +
> EDAC_MEM_REPAIR_ATTR_RO(max_nibble_mask, instance),
>> + [MEM_REPAIR_BANK_GROUP] =
>> + EDAC_MEM_REPAIR_ATTR_RW(bank_group,
>instance),
>> + [MEM_REPAIR_MIN_BANK_GROUP] =
>> +
> EDAC_MEM_REPAIR_ATTR_RO(min_bank_group, instance),
>> + [MEM_REPAIR_MAX_BANK_GROUP] =
>> +
> EDAC_MEM_REPAIR_ATTR_RO(max_bank_group, instance),
>> + [MEM_REPAIR_BANK] = EDAC_MEM_REPAIR_ATTR_RW(bank,
>instance),
>> + [MEM_REPAIR_MIN_BANK] =
>EDAC_MEM_REPAIR_ATTR_RO(min_bank, instance),
>> + [MEM_REPAIR_MAX_BANK] =
>EDAC_MEM_REPAIR_ATTR_RO(max_bank, instance),
>> + [MEM_REPAIR_RANK] = EDAC_MEM_REPAIR_ATTR_RW(rank,
>instance),
>> + [MEM_REPAIR_MIN_RANK] =
>EDAC_MEM_REPAIR_ATTR_RO(min_rank, instance),
>> + [MEM_REPAIR_MAX_RANK] =
>EDAC_MEM_REPAIR_ATTR_RO(max_rank, instance),
>> + [MEM_REPAIR_ROW] = EDAC_MEM_REPAIR_ATTR_RW(row,
>instance),
>> + [MEM_REPAIR_MIN_ROW] =
>EDAC_MEM_REPAIR_ATTR_RO(min_row, instance),
>> + [MEM_REPAIR_MAX_ROW] =
>EDAC_MEM_REPAIR_ATTR_RO(max_row, instance),
>> + [MEM_REPAIR_COLUMN] =
>EDAC_MEM_REPAIR_ATTR_RW(column, instance),
>> + [MEM_REPAIR_MIN_COLUMN] =
>EDAC_MEM_REPAIR_ATTR_RO(min_column, instance),
>> + [MEM_REPAIR_MAX_COLUMN] =
>EDAC_MEM_REPAIR_ATTR_RO(max_column, instance),
>> + [MEM_REPAIR_CHANNEL] =
>EDAC_MEM_REPAIR_ATTR_RW(channel, instance),
>> + [MEM_REPAIR_MIN_CHANNEL] =
>EDAC_MEM_REPAIR_ATTR_RO(min_channel, instance),
>> + [MEM_REPAIR_MAX_CHANNEL] =
>EDAC_MEM_REPAIR_ATTR_RO(max_channel, instance),
>> + [MEM_REPAIR_SUB_CHANNEL] =
>> + EDAC_MEM_REPAIR_ATTR_RW(sub_channel,
>instance),
>> + [MEM_REPAIR_MIN_SUB_CHANNEL] =
>> +
> EDAC_MEM_REPAIR_ATTR_RO(min_sub_channel, instance),
>> + [MEM_REPAIR_MAX_SUB_CHANNEL] =
>> +
> EDAC_MEM_REPAIR_ATTR_RO(max_sub_channel, instance),
>> + [MEM_DO_REPAIR] = EDAC_MEM_REPAIR_ATTR_WO(repair,
>instance)
>> + };
>> +
>> + ctx = devm_kzalloc(dev, sizeof(*ctx), GFP_KERNEL);
>> + if (!ctx)
>> + return -ENOMEM;
>> +
>> + for (i = 0; i < MEM_REPAIR_MAX_ATTRS; i++) {
>> + memcpy(&ctx->mem_repair_dev_attr[i].dev_attr,
>> + &dev_attr[i], sizeof(dev_attr[i]));
>> + ctx->mem_repair_attrs[i] =
>> + &ctx->mem_repair_dev_attr[i].dev_attr.attr;
>> + }
>> +
>> + sprintf(ctx->name, "%s%d", "mem_repair", instance);
>> + group = &ctx->group;
>> + group->name = ctx->name;
>> + group->attrs = ctx->mem_repair_attrs;
>> + group->is_visible = mem_repair_attr_visible;
>> + attr_groups[0] = group;
>> +
>> + return 0;
>> +}
>> +
>> +/**
>> + * edac_mem_repair_get_desc - get EDAC memory repair descriptors
>> + * @dev: client device with memory repair feature
>> + * @attr_groups: pointer to attribute group container
>> + * @instance: device's memory repair instance number.
>> + *
>> + * Return:
>> + * * %0 - Success.
>> + * * %-EINVAL - Invalid parameters passed.
>> + * * %-ENOMEM - Dynamic memory allocation failed.
>> + */
>> +int edac_mem_repair_get_desc(struct device *dev,
>> + const struct attribute_group **attr_groups, u8
>instance)
>> +{
>> + if (!dev || !attr_groups)
>> + return -EINVAL;
>> +
>> + return mem_repair_create_desc(dev, attr_groups, instance);
>> +}
>> diff --git a/include/linux/edac.h b/include/linux/edac.h
>> index 979e91426701..5d07192bf1a7 100644
>> --- a/include/linux/edac.h
>> +++ b/include/linux/edac.h
>> @@ -668,6 +668,7 @@ static inline struct dimm_info *edac_get_dimm(struct
>mem_ctl_info *mci,
>> enum edac_dev_feat {
>> RAS_FEAT_SCRUB,
>> RAS_FEAT_ECS,
>> + RAS_FEAT_MEM_REPAIR,
>> RAS_FEAT_MAX
>> };
>>
>> @@ -729,11 +730,147 @@ int edac_ecs_get_desc(struct device *ecs_dev,
>> const struct attribute_group **attr_groups,
>> u16 num_media_frus);
>>
>> +enum edac_mem_repair_function {
>> + EDAC_SOFT_PPR,
>> + EDAC_HARD_PPR,
>> + EDAC_CACHELINE_MEM_SPARING,
>> + EDAC_ROW_MEM_SPARING,
>> + EDAC_BANK_MEM_SPARING,
>> + EDAC_RANK_MEM_SPARING,
>> +};
>> +
>> +enum edac_mem_repair_persist_mode {
>> + EDAC_MEM_REPAIR_SOFT, /* soft memory repair */
>> + EDAC_MEM_REPAIR_HARD, /* hard memory repair */
>> +};
>> +
>> +enum edac_mem_repair_cmd {
>> + EDAC_DO_MEM_REPAIR = 1,
>> +};
>> +
>> +/**
>> + * struct edac_mem_repair_ops - memory repair operations
>> + * (all elements are optional except do_repair, set_hpa/set_dpa)
>> + * @get_repair_function: get the memory repair function, listed in
>> + * enum edac_mem_repair_function.
>> + * @get_persist_mode: get the current persist mode. Persist repair modes
>supported
>> + * in the device is based on the memory repair function which
>is
>> + * temporary or permanent and is lost with a power cycle.
>> + * EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary
>repair).
>> + * EDAC_MEM_REPAIR_HARD - Hard memory repair function
>(permanent repair).
>> + * All other values are reserved.
>> + * @set_persist_mode: set the persist mode of the memory repair instance.
>> + * @get_dpa_support: get dpa support flag. In some states of system
>configuration
>> + * (e.g. before address decoders have been configured),
>memory devices
>> + * (e.g. CXL) may not have an active mapping in the main host
>address
>> + * physical address map. As such, the memory to repair must be
>identified
>> + * by a device specific physical addressing scheme using a
>device physical
>> + * address(DPA). The DPA and other control attributes to use for
>the
>> + * dry_run and repair operations will be presented in related
>error records.
>> + * @get_repair_safe_when_in_use: get whether memory media is accessible
>and
>> + * data is retained during repair operation.
>> + * @get_hpa: get current host physical address (HPA).
>> + * @set_hpa: set host physical address (HPA) of memory to repair.
>> + * @get_min_hpa: get the minimum supported host physical address (HPA).
>> + * @get_max_hpa: get the maximum supported host physical address (HPA).
>> + * @get_dpa: get current device physical address (DPA).
>> + * @set_dpa: set device physical address (DPA) of memory to repair.
>> + * @get_min_dpa: get the minimum supported device physical address
>(DPA).
>> + * @get_max_dpa: get the maximum supported device physical address
>(DPA).
>> + * @get_nibble_mask: get current nibble mask.
>> + * @set_nibble_mask: set nibble mask of memory to repair.
>> + * @get_min_nibble_mask: get the minimum supported nibble mask.
>> + * @get_max_nibble_mask: get the maximum supported nibble mask.
>> + * @get_bank_group: get current bank group.
>> + * @set_bank_group: set bank group of memory to repair.
>> + * @get_min_bank_group: get the minimum supported bank group.
>> + * @get_max_bank_group: get the maximum supported bank group.
>> + * @get_bank: get current bank.
>> + * @set_bank: set bank of memory to repair.
>> + * @get_min_bank: get the minimum supported bank.
>> + * @get_max_bank: get the maximum supported bank.
>> + * @get_rank: get current rank.
>> + * @set_rank: set rank of memory to repair.
>> + * @get_min_rank: get the minimum supported rank.
>> + * @get_max_rank: get the maximum supported rank.
>> + * @get_row: get current row.
>> + * @set_row: set row of memory to repair.
>> + * @get_min_row: get the minimum supported row.
>> + * @get_max_row: get the maximum supported row.
>> + * @get_column: get current column.
>> + * @set_column: set column of memory to repair.
>> + * @get_min_column: get the minimum supported column.
>> + * @get_max_column: get the maximum supported column.
>> + * @get_channel: get current channel.
>> + * @set_channel: set channel of memory to repair.
>> + * @get_min_channel: get the minimum supported channel.
>> + * @get_max_channel: get the maximum supported channel.
>> + * @get_sub_channel: get current sub channel.
>> + * @set_sub_channel: set sub channel of memory to repair.
>> + * @get_min_sub_channel: get the minimum supported sub channel.
>> + * @get_max_sub_channel: get the maximum supported sub channel.
>> + * @do_repair: Issue memory repair operation for the HPA/DPA and
>> + * other control attributes set for the memory to repair.
>> + */
>> +struct edac_mem_repair_ops {
>> + int (*get_repair_function)(struct device *dev, void *drv_data, u32 *val);
>> + int (*get_persist_mode)(struct device *dev, void *drv_data, u32
>*mode);
>> + int (*set_persist_mode)(struct device *dev, void *drv_data, u32 mode);
>> + int (*get_dpa_support)(struct device *dev, void *drv_data, u32 *val);
>> + int (*get_repair_safe_when_in_use)(struct device *dev, void *drv_data,
>u32 *val);
>> + int (*get_hpa)(struct device *dev, void *drv_data, u64 *hpa);
>> + int (*set_hpa)(struct device *dev, void *drv_data, u64 hpa);
>> + int (*get_min_hpa)(struct device *dev, void *drv_data, u64 *hpa);
>> + int (*get_max_hpa)(struct device *dev, void *drv_data, u64 *hpa);
>> + int (*get_dpa)(struct device *dev, void *drv_data, u64 *dpa);
>> + int (*set_dpa)(struct device *dev, void *drv_data, u64 dpa);
>> + int (*get_min_dpa)(struct device *dev, void *drv_data, u64 *dpa);
>> + int (*get_max_dpa)(struct device *dev, void *drv_data, u64 *dpa);
>> + int (*get_nibble_mask)(struct device *dev, void *drv_data, u64 *val);
>> + int (*set_nibble_mask)(struct device *dev, void *drv_data, u64 val);
>> + int (*get_min_nibble_mask)(struct device *dev, void *drv_data, u64
>*val);
>> + int (*get_max_nibble_mask)(struct device *dev, void *drv_data, u64
>*val);
>> + int (*get_bank_group)(struct device *dev, void *drv_data, u32 *val);
>> + int (*set_bank_group)(struct device *dev, void *drv_data, u32 val);
>> + int (*get_min_bank_group)(struct device *dev, void *drv_data, u32
>*val);
>> + int (*get_max_bank_group)(struct device *dev, void *drv_data, u32
>*val);
>> + int (*get_bank)(struct device *dev, void *drv_data, u32 *val);
>> + int (*set_bank)(struct device *dev, void *drv_data, u32 val);
>> + int (*get_min_bank)(struct device *dev, void *drv_data, u32 *val);
>> + int (*get_max_bank)(struct device *dev, void *drv_data, u32 *val);
>> + int (*get_rank)(struct device *dev, void *drv_data, u32 *val);
>> + int (*set_rank)(struct device *dev, void *drv_data, u32 val);
>> + int (*get_min_rank)(struct device *dev, void *drv_data, u32 *val);
>> + int (*get_max_rank)(struct device *dev, void *drv_data, u32 *val);
>> + int (*get_row)(struct device *dev, void *drv_data, u64 *val);
>> + int (*set_row)(struct device *dev, void *drv_data, u64 val);
>> + int (*get_min_row)(struct device *dev, void *drv_data, u64 *val);
>> + int (*get_max_row)(struct device *dev, void *drv_data, u64 *val);
>> + int (*get_column)(struct device *dev, void *drv_data, u32 *val);
>> + int (*set_column)(struct device *dev, void *drv_data, u32 val);
>> + int (*get_min_column)(struct device *dev, void *drv_data, u32 *val);
>> + int (*get_max_column)(struct device *dev, void *drv_data, u32 *val);
>> + int (*get_channel)(struct device *dev, void *drv_data, u32 *val);
>> + int (*set_channel)(struct device *dev, void *drv_data, u32 val);
>> + int (*get_min_channel)(struct device *dev, void *drv_data, u32 *val);
>> + int (*get_max_channel)(struct device *dev, void *drv_data, u32 *val);
>> + int (*get_sub_channel)(struct device *dev, void *drv_data, u32 *val);
>> + int (*set_sub_channel)(struct device *dev, void *drv_data, u32 val);
>> + int (*get_min_sub_channel)(struct device *dev, void *drv_data, u32
>*val);
>> + int (*get_max_sub_channel)(struct device *dev, void *drv_data, u32
>*val);
>> + int (*do_repair)(struct device *dev, void *drv_data, u32 val);
>> +};
>> +
>> +int edac_mem_repair_get_desc(struct device *dev,
>> + const struct attribute_group **attr_groups,
>> + u8 instance);
>> +
>> /* EDAC device feature information structure */
>> struct edac_dev_data {
>> union {
>> const struct edac_scrub_ops *scrub_ops;
>> const struct edac_ecs_ops *ecs_ops;
>> + const struct edac_mem_repair_ops *mem_repair_ops;
>> };
>> u8 instance;
>> void *private;
>> @@ -744,6 +881,7 @@ struct edac_dev_feat_ctx {
>> void *private;
>> struct edac_dev_data *scrub;
>> struct edac_dev_data ecs;
>> + struct edac_dev_data *mem_repair;
>> };
>>
>> struct edac_dev_feature {
>> @@ -752,6 +890,7 @@ struct edac_dev_feature {
>> union {
>> const struct edac_scrub_ops *scrub_ops;
>> const struct edac_ecs_ops *ecs_ops;
>> + const struct edac_mem_repair_ops *mem_repair_ops;
>> };
>> void *ctx;
>> struct edac_ecs_ex_info ecs_info;
>
>Thanks,
>Mauro
Thanks,
Shiju
Powered by blists - more mailing lists