[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250910093347.75822-1-tianruidong@linux.alibaba.com>
Date: Wed, 10 Sep 2025 17:33:42 +0800
From: Ruidong Tian <tianruidong@...ux.alibaba.com>
To: xueshuai@...ux.alibaba.com,
palmer@...belt.com,
paul.walmsley@...ive.com,
linux-riscv@...ts.infradead.org,
linux-kernel@...r.kernel.org,
linux-acpi@...r.kernel.org
Cc: james.morse@....com,
tony.luck@...el.com,
cleger@...osinc.com,
hchauhan@...tanamicro.com,
tianruidong@...ux.alibaba.com
Subject: [RFC PATCH 0/5] riscv: Handle synchronous hardware error exception
Hi all,
This patch series introduces support for handling synchronous hardware errors
on RISC-V, laying the groundwork for more robust kernel-mode error recovery.
1. Background
Hardware error reporting mechanisms typically fall into two categories:
asynchronous and synchronous.
- Asynchronous errors (e.g., memory scrubbing errors) repoted by a asynchronous
exceptions or a interrupt, are usually handled by GHES subsystems. For instance,
ARM uses SDEI, and a similar SSE specification is being proposed for RISC-V.
- Synchronous errors (e.g., reading poisoned data) cause the processor core to
take a precise exception. This is known as a Synchronous External Abort (SEA)
on ARM, a Machine Check Exception (MCE) on x86, and is designated as trap with
mcause 19 on RISC-V.
Discussions within the RVI PRS TG have already led to proposals[0] to UEFI for
standardizing two notification methods, SSE and Hardware Error Exception,
on RISC-V.
This series focuses on implementing Hardware Error Exception notification to
handle synchronous errors. Himanshu Chauhan has already started working on SSE[1].
2. Motivation
While a synchronous hardware errors occurring in kernel context (e.g., during
get_user, put_user, CoW, etc.). The kernel requires a fixup mechanism (via
extable) to recover from such errors and prevent a system panic. However, the
APEI/GHES subsystem, being asynchronous, cannot directly leverage the synchronous
extable fixup path.
By handling the synchronous exception directly, we enable the use of this fixup
mechanism, allowing the kernel to gracefully recover from hardware errors
encountered during kernel execution. This brings RISC-V's error handling
capabilities closer to the robustness found on ARM[2] and x86[3].
3. What This Patch Series Does
This initial series lays the foundational infrastructure. It primarily:
- Introduces a new exception handler for synchronous hardware errors (mcause=19).
- Establishes the core exception path, which is a prerequisite for kernel
context error recovery.
Please note that this version does not yet implement the full kernel fixup logic
for recovery. That functionality is planned for the next formal version.
Some adaptations for GHES are included, based on the work from Himanshu Chauhan[1]
4. Future Plans
- Implement full kernel fixup support to handle and recover from errors in
some kernel context[2].
- Add support for handling "double trap" scenarios.
5. Testing Methodology
test program: ras-tools: https://kernel.googlesource.com/pub/scm/linux/kernel/git/aegl/ras-tools/
qemu: https://github.com/winterddd/qemu
offcial opensbi and edk2:
- Run qemu:
qemu-system-riscv64 -M virt,pflash0=pflash0,pflash1=pflash1,acpi=on,aia=aplic-imsic
-cpu max -m 64G -smp 64 -device virtio-gpu-pci -full-screen -device qemu-xhci
-device usb-kbd -device virtio-rng-pci
-blockdev node-name=pflash0,driver=file,read-only=on,filename=RISCV_VIRT_CODE.fd
-blockdev node-name=pflash1,driver=file,filename=RISCV_VIRT_VARS.fd
-bios fw_dynamic.bin -device virtio-net-device,netdev=net0
-netdev user,id=net0,hostfwd=tcp::2223-:22
-kernel Image -initrd rootfs
-append "rdinit=/sbin/init earlycon verbose debug strict_devmem=0 nokaslr"
-monitor telnet:127.0.0.1:5557,server,nowait -nographic
- Run ras-tools:
./einj_mem_uc -j -k single &
$ 0: single vaddr = 0x7fff86ff4400 paddr = 107d11b400
- Inject poison
telnet localhost 5557
poison_enable on
poison_add 0x107d11b400
- Read poison
echo trigger > ./trigger_start
$ triggering ...
$ signal 7 code 3 addr 0x7fff86ff4400
[0]: https://lists.riscv.org/g/tech-prs/topic/risc_v_ras_related_ecrs/113685653
[1]: https://patchew.org/linux/20250227123628.2931490-1-hchauhan@ventanamicro.com/
[2]: https://lore.kernel.org/lkml/20241209024257.3618492-1-tongtiangen@huawei.com/
[3]: https://github.com/torvalds/linux/blob/9dd1835ecda5b96ac88c166f4a87386f3e727bd9/arch/x86/kernel/cpu/mce/core.c#L1514
Himanshu Chauhan (2):
riscv: Define ioremap_cache for RISC-V
riscv: Define arch_apei_get_mem_attribute for RISC-V
Ruidong Tian (3):
acpi: Introduce SSE and HEE in HEST notification types
riscv: Introduce HEST HEE notification handlers for APEI
riscv: Add Hardware Error Exception trap handler
arch/riscv/Kconfig | 1 +
arch/riscv/include/asm/acpi.h | 22 +++++++++++++
arch/riscv/include/asm/fixmap.h | 6 ++++
arch/riscv/include/asm/io.h | 3 ++
arch/riscv/kernel/acpi.c | 55 +++++++++++++++++++++++++++++++
arch/riscv/kernel/entry.S | 4 +++
arch/riscv/kernel/traps.c | 19 +++++++++++
drivers/acpi/apei/Kconfig | 12 +++++++
drivers/acpi/apei/ghes.c | 58 +++++++++++++++++++++++++++++++++
include/acpi/actbl1.h | 4 ++-
include/acpi/ghes.h | 6 ++++
11 files changed, 189 insertions(+), 1 deletion(-)
--
2.43.7
Powered by blists - more mailing lists