[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2a4f1e0c8c9543caa71a0cb50e65e256@huawei.com>
Date: Wed, 31 Jul 2024 11:28:35 +0000
From: Shiju Jose <shiju.jose@...wei.com>
To: Fan Ni <nifan.cxl@...il.com>
CC: "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"linux-cxl@...r.kernel.org" <linux-cxl@...r.kernel.org>,
"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "bp@...en8.de" <bp@...en8.de>,
"tony.luck@...el.com" <tony.luck@...el.com>, "rafael@...nel.org"
<rafael@...nel.org>, "lenb@...nel.org" <lenb@...nel.org>,
"mchehab@...nel.org" <mchehab@...nel.org>, "dan.j.williams@...el.com"
<dan.j.williams@...el.com>, "dave@...olabs.net" <dave@...olabs.net>,
"Jonathan Cameron" <jonathan.cameron@...wei.com>, "dave.jiang@...el.com"
<dave.jiang@...el.com>, "alison.schofield@...el.com"
<alison.schofield@...el.com>, "vishal.l.verma@...el.com"
<vishal.l.verma@...el.com>, "ira.weiny@...el.com" <ira.weiny@...el.com>,
"david@...hat.com" <david@...hat.com>, "Vilas.Sridharan@....com"
<Vilas.Sridharan@....com>, "leo.duran@....com" <leo.duran@....com>,
"Yazen.Ghannam@....com" <Yazen.Ghannam@....com>, "rientjes@...gle.com"
<rientjes@...gle.com>, "jiaqiyan@...gle.com" <jiaqiyan@...gle.com>,
"Jon.Grimm@....com" <Jon.Grimm@....com>, "dave.hansen@...ux.intel.com"
<dave.hansen@...ux.intel.com>, "naoya.horiguchi@....com"
<naoya.horiguchi@....com>, "james.morse@....com" <james.morse@....com>,
"jthoughton@...gle.com" <jthoughton@...gle.com>, "somasundaram.a@....com"
<somasundaram.a@....com>, "erdemaktas@...gle.com" <erdemaktas@...gle.com>,
"pgonda@...gle.com" <pgonda@...gle.com>, "duenwen@...gle.com"
<duenwen@...gle.com>, "mike.malvestuto@...el.com"
<mike.malvestuto@...el.com>, "gthelen@...gle.com" <gthelen@...gle.com>,
"wschwartz@...erecomputing.com" <wschwartz@...erecomputing.com>,
"dferguson@...erecomputing.com" <dferguson@...erecomputing.com>,
"wbs@...amperecomputing.com" <wbs@...amperecomputing.com>, tanxiaofei
<tanxiaofei@...wei.com>, "Zengtao (B)" <prime.zeng@...ilicon.com>, "Roberto
Sassu" <roberto.sassu@...wei.com>, "kangkang.shen@...urewei.com"
<kangkang.shen@...urewei.com>, wanghuiqiang <wanghuiqiang@...wei.com>,
Linuxarm <linuxarm@...wei.com>
Subject: RE: [PATCH v10 00/11] EDAC: Scrub: introduce generic EDAC RAS control
feature driver + CXL/ACPI-RAS2 drivers
>-----Original Message-----
>From: Fan Ni <nifan.cxl@...il.com>
>Sent: 30 July 2024 22:01
>To: Shiju Jose <shiju.jose@...wei.com>
>Cc: linux-edac@...r.kernel.org; linux-cxl@...r.kernel.org; linux-
>acpi@...r.kernel.org; linux-mm@...ck.org; linux-kernel@...r.kernel.org;
>bp@...en8.de; tony.luck@...el.com; rafael@...nel.org; lenb@...nel.org;
>mchehab@...nel.org; dan.j.williams@...el.com; dave@...olabs.net; Jonathan
>Cameron <jonathan.cameron@...wei.com>; dave.jiang@...el.com;
>alison.schofield@...el.com; vishal.l.verma@...el.com; ira.weiny@...el.com;
>david@...hat.com; Vilas.Sridharan@....com; leo.duran@....com;
>Yazen.Ghannam@....com; rientjes@...gle.com; jiaqiyan@...gle.com;
>Jon.Grimm@....com; dave.hansen@...ux.intel.com;
>naoya.horiguchi@....com; james.morse@....com; jthoughton@...gle.com;
>somasundaram.a@....com; erdemaktas@...gle.com; pgonda@...gle.com;
>duenwen@...gle.com; mike.malvestuto@...el.com; gthelen@...gle.com;
>wschwartz@...erecomputing.com; dferguson@...erecomputing.com;
>wbs@...amperecomputing.com; nifan.cxl@...il.com; tanxiaofei
><tanxiaofei@...wei.com>; Zengtao (B) <prime.zeng@...ilicon.com>; Roberto
>Sassu <roberto.sassu@...wei.com>; kangkang.shen@...urewei.com;
>wanghuiqiang <wanghuiqiang@...wei.com>; Linuxarm
><linuxarm@...wei.com>
>Subject: Re: [PATCH v10 00/11] EDAC: Scrub: introduce generic EDAC RAS
>control feature driver + CXL/ACPI-RAS2 drivers
>
>On Fri, Jul 26, 2024 at 05:05:44PM +0100, shiju.jose@...wei.com wrote:
>> From: Shiju Jose <shiju.jose@...wei.com>
>>
>Hi Shiju,
>FYI. Not sure if I missed something, but I hit some issue when compiling the
>patch set.
>
>I tried to apply this patch set on top of fixes branch (tag:
>cxl-fixes-6.10-rc7, origin/fixes), and compile, it fails with following erors.
>
>fan@DT ~/c/linux-fixes (fixes)> make -j 16 mkdir -p /home/fan/cxl/linux-
>fixes/tools/objtool && make O=/home/fan/cxl/linux-fixes subdir=tools/objtool --
>no-print-directory -C objtool
> INSTALL libsubcmd_headers
> CALL scripts/checksyscalls.sh
> CC drivers/ras/acpi_ras2.o
>drivers/ras/acpi_ras2.c: In function ‘ras2_update_patrol_scrub_params_cache’:
>drivers/ras/acpi_ras2.c:64:37: error: implicit declaration of function ‘FIELD_GET’
>[-Werror=implicit-function-declaration]
> 64 | ras2_ctx->min_scrub_cycle =
>FIELD_GET(RAS2_PATROL_SCRUB_MIN_SCHRS_OUT_MASK,
> | ^~~~~~~~~
>drivers/ras/acpi_ras2.c: In function ‘ras2_hw_scrub_set_enabled_bg’:
>drivers/ras/acpi_ras2.c:216:50: error: implicit declaration of function
>‘FIELD_PREP’ [-Werror=implicit-function-declaration]
> 216 | ps_sm->params.scrub_params_in |=
>FIELD_PREP(RAS2_PATROL_SCRUB_SCHRS_IN_MASK,
> | ^~~~~~~~~~
>cc1: all warnings being treated as errors
>make[4]: *** [scripts/Makefile.build:244: drivers/ras/acpi_ras2.o] Error 1
>make[3]: *** [scripts/Makefile.build:485: drivers/ras] Error 2
>make[3]: *** Waiting for unfinished jobs....
>make[2]: *** [scripts/Makefile.build:485: drivers] Error 2
>make[1]: *** [/home/fan/cxl/linux-fixes/Makefile:1934: .] Error 2
>make: *** [Makefile:240: __sub-make] Error 2
>
>
>It seems we need to include "linux/bitfield.h". Fixed the compiling with following
>change:
>
>fan@DT ~/c/linux-fixes (fixes)> git diff diff --git a/drivers/ras/acpi_ras2.c
>b/drivers/ras/acpi_ras2.c index b2381d5b37d0..b4d229e6451f 100644
>--- a/drivers/ras/acpi_ras2.c
>+++ b/drivers/ras/acpi_ras2.c
>@@ -11,6 +11,7 @@
> #include <linux/edac_ras_feature.h>
> #include <linux/platform_device.h>
> #include <acpi/ras2_acpi.h>
>+#include <linux/bitfield.h>
>
> #define RAS2_DEV_NUM_RAS_FEATURES 1
>
>
>
>Fan
Hi Fan,
Thanks for looking into this issue and giving the fix.
The series was tested and produced on mainline v6.10-rc3 with arm64 build.
I checked in the cxl-fixes-6.10-rc7 and mainline v6.10-rc7 for this issue.
For arm64, build is successful, however for X86, build reproduced the build errors
you reported.
I will incorporate the fix in the next version.
Thanks,
Shiju
>
>> Previously known as "ras: scrub: introduce subsystem + CXL/ACPI-RAS2
>drivers".
>>
>> EDAC based Subsystem for controlling RAS Features
>> =================================================
>> The proposed EDAC based subsystem for controlling RAS features and
>> expose the feature's control attributes to the userspace in sysfs.
>> Some Examples:
>> - Scrub control
>> - Error Check Scrub (ECS) control
>> - ACPI RAS2 features
>> - ACPI Address Range Scrubbing (ARS)
>> - Post Package Repair (PPR) etc.
>>
>> High level design is illustrated in the following diagram.
>>
>> _______________________________________________
>> | Userspace - Rasdaemon |
>> | ____________ |
>> | | RAS CXL | _____________ |
>> | | Err Handler|----->| | |
>> | |____________| | RAS Dynamic | |
>> | ____________ | Scrub | |
>> | | RAS Memory |----->| Controller | |
>> | | Err Handler| |_____________| |
>> | |____________| | |
>> |__________________________|____________________|
>> |
>> |
>> _______________________________|______________________________
>> | Kernel EDAC based SubSystem | for RAS Features Control |
>> | ______________________________|____________________________ |
>> || EDAC Core Sysfs EDAC| Bus | |
>> || __________________________|_______ _____________ | |
>> || |/sys/bus/edac/devices/<dev>/scrub/| | EDAC Device | | |
>> || |/sys/bus/edac/devices/<dev>/ecs*/ |<->| EDAC MC | | |
>> || |/sys/bus/edac/devices/<dev>/ars/ | | EDAC Sysfs | | |
>> || |/sys/bus/edac/devices/<dev>/ppr/ | | EDAC Module | | |
>> || |__________________________________| |_____________| | |
>> || | EDAC Bus | |
>> || Get | | |
>> || __________ Feature's | __________ | |
>> || | |Descs _________|______ | | | |
>> || |EDAC Scrub|<-----| EDAC RAS |---->| EDAC ARS | | |
>> || |__________| |Control Feature | |__________| | |
>> || __________ | Driver | __________ | |
>> || | |<-----|________________|---->| | | |
>> || |EDAC ECS | Register RAS | Features | EDAC PPR | | |
>> || |__________| | |__________| | |
>> || ______________________|___________________ | |
>> ||_________|_____________|_____________|____________|________| |
>> | _______|____ _____|______ ____|______ ___|_____ |
>> | | | | CXL Mem | | | | | |
>> | | ACPI RAS2 | | Driver | | ACPI ARS | | PPR | |
>> | | Driver | | Scrub,ECS | | Driver | | Driver | |
>> | |____________| |___________| |___________| |_________| |
>> | | | | | |
>>
>|________|______________|______________|___________|___________|
>> | | | |
>> _______|______________|______________|___________|___________
>> | __|______________|_ ____________|___________|_____ |
>> | | | |
>> | | Platform HW and Firmware | |
>> | |__________________________________________________| |
>>
>|_____________________________________________________________|
>>
>> 1. EDAC Features components - Create feature specific descriptors.
>> 2. EDAC RAS Feature driver - Get feature's attr descriptors from the
>> EDAC RAS feature component and registers device's RAS features with
>> EDAC bus and expose the feature's sysfs attributes under the sysfs
>> EDAC bus.
>> 3. RAS dynamic scrub controller - Userspace sample module added in the
>> rasdaemon to start scrubbing when excess number of related errors
>> are reported in a short span of time.
>>
>> The added EDAC feature specific components (e.g. EDAC scrub, EDAC ECS,
>> EDAC PPR etc) do callbacks to the parent driver (e.g. CXL driver,
>> ACPI RAS driver etc) for the controls rather than just letting the
>> caller deal with it because of the following reasons.
>> 1. Enforces a common API across multiple implementations can do that
>> via review, but that's not generally gone well in the long run for
>> subsystems that have done it (several have later moved to callback
>> and feature list based approaches).
>> 2. Gives a path for 'intercepting' in the EDAC feature driver.
>> An example for this is that we could intercept PPR repair calls
>> and sanity check that the memory in question is offline before
>> passing back to the underlying code. Sure we could rely on doing
>> that via some additional calls from the parent driver, but the
>> ABI will get messier.
>> 3. (Speculative) we may get in kernel users of some features in the
>> long run.
>>
>> More details of the common RAS features are described in the following
>> sections.
>>
>> Memory Scrubbing
>> ================
>> Increasing DRAM size and cost has made memory subsystem reliability an
>> important concern. These modules are used where potentially corrupted
>> data could cause expensive or fatal issues. Memory errors are one of
>> the top hardware failures that cause server and workload crashes.
>>
>> Memory scrub is a feature where an ECC engine reads data from each
>> memory media location, corrects with an ECC if necessary and writes
>> the corrected data back to the same memory media location.
>>
>> The memory DIMMs could be scrubbed at a configurable rate to detect
>> uncorrected memory errors and attempts to recover from detected memory
>> errors providing the following benefits.
>> - Proactively scrubbing memory DIMMs reduces the chance of a correctable
>> error becoming uncorrectable.
>> - Once detected, uncorrected errors caught in unallocated memory pages are
>> isolated and prevented from being allocated to an application or the OS.
>> - The probability of software/hardware products encountering memory
>> errors is reduced.
>> Some details of background can be found in Reference [5].
>>
>> There are 2 types of memory scrubbing, 1. Background (patrol)
>> scrubbing of the RAM whilest the RAM is otherwise
>> idle.
>> 2. On-demand scrubbing for a specific address range/region of memory.
>>
>> There are several types of interfaces to HW memory scrubbers
>> identified such as ACPI NVDIMM ARS(Address Range Scrub), CXL memory
>> device patrol scrub, CXL DDR5 ECS, ACPI RAS2 memory scrubbing.
>>
>> The scrub control varies between different memory scrubbers. To allow
>> for standard userspace tooling there is a need to present these
>> controls with a standard ABI.
>>
>> Introduce generic memory EDAC scrub control which allows user to
>> control underlying scrubbers in the system via generic sysfs scrub
>> control interface.
>>
>> Use case of common scrub control feature
>> ========================================
>> 1. There are several types of interfaces to HW memory scrubbers identified
>> such as ACPI NVDIMM ARS(Address Range Scrub), CXL memory device patrol
>> scrub, CXL DDR5 ECS, ACPI RAS2 memory scrubbing features and software
>> based memory scrubber(discussed in the community Reference [5]).
>> Also some scrubbers support controlling (background) patrol scrubbing
>> (ACPI RAS2, CXL) and/or on-demand scrubbing(ACPI RAS2, ACPI ARS).
>> However the scrub controls varies between memory scrubbers. Thus there
>> is a requirement for a standard generic sysfs scrub controls exposed
>> to the userspace for the seamless control of the HW/SW scrubbers in
>> the system by admin/scripts/tools etc.
>> 2. Scrub controls in user space allow the user to disable the scrubbing
>> in case disabling of the background patrol scrubbing or changing the
>> scrub rate are needed for other purposes such as performance-aware
>> operations which requires the background operations to be turned off
>> or reduced.
>> 3. Allows to perform on-demand scrubbing for specific address range if
>> supported by the scrubber.
>> 4. User space tools controls scrub the memory DIMMs regularly at a
>> configurable scrub rate using the sysfs scrub controls discussed help,
>> - to detect uncorrectable memory errors early before user accessing
>memory,
>> which helps to recover the detected memory errors.
>> - reduces the chance of a correctable error becoming uncorrectable.
>> 5. Policy control for hotplugged memory. There is not necessarily a system
>> wide bios or similar in the loop to control the scrub settings on a CXL
>> device that wasn't there at boot. What that setting should be is a policy
>> decision as we are trading of reliability vs performance - hence it should
>> be in control of userspace. As such, 'an' interface is needed. Seems more
>> sensible to try and unify it with other similar interfaces than spin
>> yet another one.
>>
>> The draft version of userspace code for dynamic scrub control, based
>> on frequency of memory errors reported to the userspace, is added in
>> rasdaemon and enabled, tested for CXL device based patrol scrubbing
>> feature and ACPI RAS2 based scrubbing feature.
>>
>> https://github.com/shijujose4/rasdaemon/tree/scrub_control_6_june_2024
>>
>> Comparison of scrubbing features
>> ================================
>> ................................................................
>> . . ACPI . CXL patrol. CXL ECS . ARS .
>> . Name . RAS2 . scrub . . .
>> ................................................................
>> . . . . . .
>> . On-demand . Supported . No . No . Supported .
>> . Scrubbing . . . . .
>> . . . . . .
>> ................................................................
>> . . . . . .
>> . Background . Supported . Supported . Supported . No .
>> . scrubbing . . . . .
>> . . . . . .
>> ................................................................
>> . . . . . .
>> . Mode of . Scrub ctrl. per device. per memory. Unknown .
>> . scrubbing . per NUMA . . media . .
>> . . domain. . . . .
>> ................................................................
>> . . . . . .
>> . Query scrub . Supported . Supported . Supported . Supported .
>> . capabilities . . . . .
>> . . . . . .
>> ................................................................
>> . . . . . .
>> . Setting . Supported . No . No . Supported .
>> . address range. . . . .
>> . . . . . .
>> ................................................................
>> . . . . . .
>> . Setting . Supported . Supported . No . No .
>> . scrub rate . . . . .
>> . . . . . .
>> ................................................................
>> . . . . . .
>> . Unit for . Not . in hours . No . No .
>> . scrub rate . Defined . . . .
>> . . . . . .
>> ................................................................
>> . . Supported . . . .
>> . Scrub . on-demand . No . No . Supported .
>> . status/ . scrubbing . . . .
>> . Completion . only . . . .
>> ................................................................
>> . UC error . .CXL general.CXL general. ACPI UCE .
>> . reporting . Exception .media/DRAM .media/DRAM . notify and.
>> . . .event/media.event/media. query .
>> . . .scan? .scan? . ARS status.
>> ................................................................
>> . . . . . .
>> . Clear UC . No . No . No . Supported .
>> . error . . . . .
>> . . . . . .
>> ................................................................
>> . . . . . .
>> . Translate . No . No . No . Supported .
>> . *(1)SPA to . . . . .
>> . *(2)DPA . . . . .
>> ................................................................
>> . . . . . .
>> . Error inject . No . Can inject. No . Supported .
>> . . . poison for. . .
>> . . . CXL . . .
>> ................................................................
>> *(1) - SPA - System Physical Address. See section 9.19.7.8
>> Function Index 5 - Translate SPA of ACPI spec r6.5.
>> *(2) - DPA - Device Physical Address. See section 9.19.7.8
>> Function Index 5 - Translate SPA of ACPI spec r6.5.
>>
>> CXL Scrubbing features
>> ======================
>> Add support for control CXL patrol scrubber and ACPI RAS2 HW based
>> memory patrol scrubber and register with the EDAC scrub to expose the
>> scrub controls to the userspace tool.
>>
>> CXL spec r3.1 section 8.2.9.9.11.1 describes the memory device patrol
>> scrub control feature. The device patrol scrub proactively locates and
>> makes corrections to errors in regular cycle. The patrol scrub control
>> allows the request to configure patrol scrubber's input configurations.
>>
>> The patrol scrub control allows the requester to specify the number of
>> hours in which the patrol scrub cycles must be completed, provided
>> that the requested number is not less than the minimum number of hours
>> for the patrol scrub cycle that the device is capable of. In addition,
>> the patrol scrub controls allow the host to disable and enable the
>> feature in case disabling of the feature is needed for other purposes
>> such as performance-aware operations which require the background
>> operations to be turned off.
>>
>> The Error Check Scrub (ECS) is a feature defined in JEDEC DDR5 SDRAM
>> Specification (JESD79-5) and allows the DRAM to internally read,
>> correct single-bit errors, and write back corrected data bits to the
>> DRAM array while providing transparency to error counts.
>>
>> The DDR5 device contains number of memory media FRUs per device. The
>> DDR5 ECS feature and thus the ECS control driver supports configuring
>> the ECS parameters per FRU.
>>
>> ACPI RAS2 Hardware-based Memory Scrubbing
>> =========================================
>> ACPI spec 6.5 section 5.2.21 ACPI RAS2 describes ACPI RAS2 table
>> provides interfaces for platform RAS features and supports independent
>> RAS controls and capabilities for a given RAS feature for multiple
>> instances of the same component in a given system.
>> Memory RAS features apply to RAS capabilities, controls and operations
>> that are specific to memory. RAS2 PCC sub-spaces for memory-specific
>> RAS features have a Feature Type of 0x00 (Memory).
>>
>> The platform can use the hardware-based memory scrubbing feature to
>> expose controls and capabilities associated with hardware-based memory
>> scrub engines. The RAS2 memory scrubbing feature supports following as
>> per spec,
>> - Independent memory scrubbing controls for each NUMA domain, identified
>> using its proximity domain.
>> Note: However AmpereComputing has single entry repeated as they have
>> centralized controls.
>> - Provision for background (patrol) scrubbing of the entire memory system,
>> as well as on-demand scrubbing for a specific region of memory.
>>
>> ACPI Address Range Scrubbing(ARS)
>> ================================
>> ARS allows the platform to communicate memory errors to system software.
>> This capability allows system software to prevent accesses to
>> addresses with uncorrectable errors in memory. ARS functions manage
>> all NVDIMMs present in the system. Only one scrub can be in progress
>> system wide at any given time.
>> Following functions are supported as per the specification.
>> 1. Query ARS Capabilities for a given address range, indicates platform
>> supports the ACPI NVDIMM Root Device Unconsumed Error Notification.
>> 2. Start ARS triggers an Address Range Scrub for the given memory range.
>> Address scrubbing can be done for volatile memory, persistent memory,
>> or both.
>> 3. Query ARS Status command allows software to get the status of ARS,
>> including the progress of ARS and ARS error record.
>> 4. Clear Uncorrectable Error.
>> 5. Translate SPA
>> 6. ARS Error Inject etc.
>> Note: Support for ARS is not added in this series because to reduce
>> the line of code for review and could be added after initial code is merged.
>> We'd like feedback on whether this is of interest to ARS community?
>>
>> Series adds,
>> 1. Generic EDAC RAS feature driver, EDAC scrub driver, EDAC ECS driver
>> supports memory scrub control, ECS control and other RAS features
>> in the system.
>> 2. Support for CXL feature mailbox commands, which is used by
>> CXL device scrubbing features.
>> 3. CXL scrub driver supporting patrol scrub control (device and
>> region based).
>> 4. CXL ECS driver supporting ECS control feature.
>> 5. ACPI RAS2 driver adds OS interface for RAS2 communication through
>> PCC mailbox and extracts ACPI RAS2 feature table (RAS2) and
>> create platform device for the RAS memory features, which binds
>> to the memory ACPI RAS2 driver.
>> 7. Memory ACPI RAS2 driver gets the PCC subspace for communicating
>> with the ACPI compliant platform supports ACPI RAS2. Add callback
>> functions and registers with EDAC scrub to support user to
>> control the HW patrol scrubbers exposed to the kernel via the
>> ACPI RAS2 table.
>>
>> The CXL specific scrub and ECS features are supported in
>> https://gitlab.com/qemu-project/qemu.git
>>
>> Open Questions based on feedbacks from the community:
>> 1. Leo: Standardize unit for scrub rate, for example ACPI RAS2 does not define
>> unit for the scrub rate. RAS2 clarification needed.
>> 2. Jonathan: Any need for discoverability of capability to scan different regions,
>> such as global PA space to the userspace. Left as future extension.
>> 3. Jiaqi:
>> - STOP_PATROL_SCRUBBER from RAS2 must be blocked and, must not be
>exposed to
>> OS/userspace. Stopping patrol scrubber is unacceptable for platform where
>> OEM has enabled patrol scrubber, because the patrol scrubber is a key part
>> of logging and is repurposed for other RAS actions.
>> If the OEM does not want to expose this control, they should lock it down so
>the
>> interface is not exposed to the OS. These features are optional afterall.
>> - "Requested Address Range"/"Actual Address Range" (region to scrub) is a
>> similarly bad thing to expose in RAS2.
>> If the OEM does not want to expose this, they should lock it down so the
>> interface is not exposed to the OS. These features are optional afterall.
>> 4. Borislav:
>> - How the scrub control exposed to the userspace will be used?
>> POC added in rasdaemon with dynamic scrub control for CXL memory
>media
>> errors and memory errors reported to the userspace.
>> https://github.com/shijujose4/rasdaemon/tree/scrub_control_6_june_2024
>> - Is the scrub interface is sufficient for the use cases?
>> - Who is going to use scrub controls tools/admin/scripts?
>> 1) Rasdaemon for dynamic control
>> 2) Udev script for more static 'defaults' on hotplug etc.
>>
>> References:
>> 1. ACPI spec r6.5 section 5.2.21 ACPI RAS2.
>> 2. ACPI spec r6.5 section 9.19.7.2 ARS.
>> 3. CXL spec r3.1 8.2.9.9.11.1 Device patrol scrub control feature 4.
>> CXL spec r3.1 8.2.9.9.11.2 DDR5 ECS feature 5. Background information
>> about kernel support for memory scan, memory
>> error detection and ACPI RASF.
>>
>> https://lore.kernel.org/all/20221103155029.2451105-1-jiaqiyan@google.c
>> om/
>> 6. Discussions on RASF:
>>
>> https://lore.kernel.org/lkml/20230915172818.761-1-shiju.jose@huawei.co
>> m/#r
>>
>> Changes
>> =======
>> v9 -> v10:
>> 1. Feedback from Mauro Carvalho Chehab:
>> - Changes suggested in EDAC RAS feature driver.
>> use uppercase for enums, if else to switch-case, documentation for
>> static scrub and ecs init functions etc.
>> - Changes suggested in EDAC scrub.
>> unit of scrub cycle hour to seconds.
>> attribute node cycle_in_hours_avaiable to min_cycle_duration and
>> max_cycle_duration.
>> attribute node cycle_in_hours to current_cycle_duration.
>> Use base 0 for kstrtou64() and kstrtol() functions.
>> etc.
>> - Changes suggested in EDAC ECS.
>> uppercase for enums
>> add ABI documentation. etc
>>
>> 2. Feedback from Fan:
>> - Changes suggested in EDAC RAS feature driver.
>> use uppercase for enums, change if...else to switch-case.
>> some optimization in edac_ras_dev_register() function
>> add missing goto free_ctx
>> - Changes suggested in the code for feature commands.
>> - CXL driver scrub and ECS code
>> use uppercase for enums, fix typo, use enum type for mode
>> fix lonf lines etc.
>>
>> v8 -> v9:
>> 1. Feedback from Borislav:
>> - Add scrub control driver to the EDAC on feedback from Borislav.
>> - Changed DEVICE_ATTR_..() static.
>> - Changed the write permissions for scrub control sysfs files as
>> root-only.
>> 2. Feedback from Fan:
>> - Optimized cxl_get_feature() function by using min() and removed
>> feat_out_min_size.
>> - Removed unreached return from cxl_set_feature() function.
>> - Changed the term "rate" to "cycle_in_hours" in all the
>> scrub control code.
>> - Allow cxl_mem_probe() continue if cxl_mem_patrol_scrub_init() fail,
>> with just a debug warning.
>>
>> 3. Feedback from Jonathan:
>> - Removed patch __free() based cleanup function for acpi_put_table.
>> and added fix in the acpi ras2 driver.
>>
>> 4. Feedback from Dan Williams:
>> - Allow cxl_mem_probe() continue if cxl_mem_patrol_scrub_init() fail,
>> with just a debug warning.
>> - Add support for CXL region based scrub control.
>>
>> 5. Feedback from Daniel Ferguson on RAS2 drivers:
>> In the ACPI RAS2 driver,
>> - Incorporated the changes given for clearing error reported.
>> - Incorporated the changes given for check the Set RAS Capability
>> status and return an appropriate error.
>> In the RAS2 memory driver,
>> - Added more checks for start/stop bg and on-demand scrubbing
>> so that addr range in cache do not get cleared and restrict
>> permitted operations during scrubbing.
>>
>> v7 -> v8:
>> 1. Add more detailed cover letter and add info for basic analysis
>> of ACPI ARS for comment from Dan Williams.
>> 2. Changed file name etc from ras2 to acpi_ras2 in memory ACPI RAS2
>> driver for comment from Boris.
>> 3. Add documents for usage for comment from Jonathan.
>> 4. Changed logic in memory/acpi_ras2.c for enable background
>> scrubbing to allow setting the scrub rate.
>> 5. Merged memory/acpi_ras2_common.c with memory/acpi_ras2.c and
>> obselete code, suggested by Jonathan.
>> 6. Initial optimizations and cleanup especially in the memory/acpi_ras2.
>> 7. Removed CXL ECS support for time being.
>> 8. Removed support for region based scrub control from the scrub
>> subsytem, which was needed for the CXL ECS, can be added later
>> if required.
>> 9. Fixed the format of few comments and a definition in CXL feature
>> code for the feedbacks from Fan.
>> 11. Jonathan done several optimizations, interface changes and
>> cleanups all over the code.
>> 12. Fixes for feedbacks from Daniel Ferguson(Amperecomputing)
>> for RAS2.
>> 13. Workaround for a RAS2 case of only one actual controller as
>> reported by Daniel Ferguson(AmpereComputing) in their hardware.
>> 14. Feedback from Yazen, move the common scrub and ras2 changes
>> under /drivers/ras/.
>> 15. Drop patch ACPICA: ACPI 6.5: Add support for RAS2 table because
>> Rafael queued the patch.
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/co
>> mmit/?h=bleeding-edge&id=9726d821f88e284ecd998b76ae5f2174721cd9dc
>>
>> v6 -> v7:
>> 1. Main changes for comments from Jonathan, Thanks.
>> 1.1. CXL
>> - Changes for deal with small mail box and supporting multipart
>> feature data transfers.
>> - Provide more specific parameters to mbox supported/get/set features
>> interface functions.
>> - kvmalloc -> kmalloc in CXL scrub mem allocation for feature commands.
>> - Changed the way using __free(kfree)
>> - Removed readback and verify for setting CXL scrub patrol and ECS
>> parameters. Could be added later if needed.
>> - In is_visible() callback functions for scrub control sysfs attrs
>> changed to writeback the default attribute mode value instead of
>> setting per attrs.
>> - Add documentation for sysfs interfaces for CXL ECS scrub control.
>> 1.2. RAS2
>> - In rasf common code, rename rasf to ras2 because RASF seems obselete.
>> - Replace pr_* with dev_* log function calls from ACPI RAS2 and
>> memory RAS2 drivers.
>> - In rasf common code, rename rasf to ras2.
>> - Removed including unnecessary .h file from memory RAS2 driver.
>> - In is_visible() callback functions for scrub control sysfs attrs
>> changed to writeback the default attribute mode value instead of
>> setting per attribute.
>>
>> 2. Changes for comments from Fan, Thanks.
>> - Add debug message if cxl patrol scrub and ecs init function
>> calls fail.
>> 3. Updated cover letter for feedback from Dan Williams.
>>
>> v5 -> v6:
>> 1. Changes for comments from Davidlohr, Thanks.
>> - Update CXL feature code based on spec 3.1.
>> - attrb -> attr
>> - Use enums with default counting.
>> 2. Rebased to the latest kernel.
>>
>> v4 -> v5:
>> 1. Following are the main changes made based on the feedback from Dan
>Williams on v4.
>> 1.1. In the scrub subsystem the common scrub control attributes are statically
>defined
>> instead of dynamically created.
>> 1.2. Add scrub subsystem support externally defined attribute group.
>> Add CXL ECS driver define ECS specific attribute group and pass to
>> the scrub subsystem.
>> 1.3. Move cxl_mem_ecs_init() to cxl/core/region.c so that the CXL region_id
>> is used in the registration with the scrub subsystem.
>> 1.4. Add previously posted RASF common and RAS2 patches to this scrub
>series.
>>
>> 2. Add support for the 'enable_background_scrub' attribute
>> for RAS2, on request from Bill
>Schwartz(wschwartz@...erecomputing.com).
>>
>> v3 -> v4:
>> 1. Fixes for the warnings/errors reported by kernel test robot.
>> 2. Add support for reading the 'enable' attribute of CXL patrol scrub.
>>
>> Changes
>> v2 -> v3:
>> 1. Changes for comments from Davidlohr, Thanks.
>> - Updated cxl scrub kconfig
>> - removed usage of the flag is_support_feature from
>> the function cxl_mem_get_supported_feature_entry().
>> - corrected spelling error.
>> - removed unnecessary debug message.
>> - removed export feature commands to the userspace.
>> 2. Possible fix for the warnings/errors reported by kernel
>> test robot.
>> 3. Add documentation for the common scrub configure attributes.
>>
>> v1 -> v2:
>> 1. Changes for comments from Dave Jiang, Thanks.
>> - Split patches.
>> - reversed xmas tree declarations.
>> - declared flags as enums.
>> - removed few unnecessary variable initializations.
>> - replaced PTR_ERR_OR_ZERO() with IS_ERR() and PTR_ERR().
>> - add auto clean declarations.
>> - replaced while loop with for loop.
>> - Removed allocation from cxl_get_supported_features() and
>> cxl_get_feature() and make change to take allocated memory
>> pointer from the caller.
>> - replaced if/else with switch case.
>> - replaced sprintf() with sysfs_emit() in 2 places.
>> - replaced goto label with return in few functions.
>> 2. removed unused code for supported attributes from ecs.
>> 3. Included following common patch for scrub configure driver
>> to this series.
>> "memory: scrub: Add scrub driver supports configuring memory scrubbers
>> in the system"
>>
>>
>> Jonathan Cameron (1):
>> platform: Add __free() based cleanup function for
>> platform_device_put
>>
>> Shiju Jose (10):
>> EDAC: Add generic EDAC RAS control feature driver
>> EDAC: Add EDAC scrub control driver
>> EDAC: Add EDAC ECS control driver
>> cxl/mbox: Add GET_SUPPORTED_FEATURES mailbox command
>> cxl/mbox: Add GET_FEATURE mailbox command
>> cxl/mbox: Add SET_FEATURE mailbox command
>> cxl/memscrub: Add CXL memory device patrol scrub control feature
>> cxl/memscrub: Add CXL memory device ECS control feature
>> ACPI:RAS2: Add ACPI RAS2 driver
>> ras: scrub: ACPI RAS2: Add memory ACPI RAS2 driver
>>
>> Documentation/ABI/testing/sysfs-edac-ecs | 85 ++
>> Documentation/ABI/testing/sysfs-edac-scrub | 76 ++
>> Documentation/scrub/edac-scrub.rst | 115 +++
>> drivers/acpi/Kconfig | 10 +
>> drivers/acpi/Makefile | 1 +
>> drivers/acpi/ras2.c | 391 +++++++++
>> drivers/cxl/Kconfig | 19 +
>> drivers/cxl/core/Makefile | 1 +
>> drivers/cxl/core/mbox.c | 139 ++++
>> drivers/cxl/core/memscrub.c | 873 +++++++++++++++++++++
>> drivers/cxl/core/region.c | 6 +
>> drivers/cxl/cxlmem.h | 129 +++
>> drivers/cxl/mem.c | 4 +
>> drivers/edac/Makefile | 1 +
>> drivers/edac/edac_ecs.c | 396 ++++++++++
>> drivers/edac/edac_ras_feature.c | 187 +++++
>> drivers/edac/edac_scrub.c | 333 ++++++++
>> drivers/ras/Kconfig | 10 +
>> drivers/ras/Makefile | 1 +
>> drivers/ras/acpi_ras2.c | 415 ++++++++++
>> include/acpi/ras2_acpi.h | 59 ++
>> include/linux/edac_ras_feature.h | 132 ++++
>> include/linux/platform_device.h | 1 +
>> 23 files changed, 3384 insertions(+)
>> create mode 100644 Documentation/ABI/testing/sysfs-edac-ecs
>> create mode 100644 Documentation/ABI/testing/sysfs-edac-scrub
>> create mode 100644 Documentation/scrub/edac-scrub.rst
>> create mode 100755 drivers/acpi/ras2.c create mode 100644
>> drivers/cxl/core/memscrub.c create mode 100755
>> drivers/edac/edac_ecs.c create mode 100755
>> drivers/edac/edac_ras_feature.c create mode 100755
>> drivers/edac/edac_scrub.c create mode 100644 drivers/ras/acpi_ras2.c
>> create mode 100644 include/acpi/ras2_acpi.h create mode 100755
>> include/linux/edac_ras_feature.h
>>
>> --
>> 2.34.1
>>
Powered by blists - more mailing lists