lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250910093347.75822-1-tianruidong@linux.alibaba.com>
Date: Wed, 10 Sep 2025 17:33:42 +0800
From: Ruidong Tian <tianruidong@...ux.alibaba.com>
To: xueshuai@...ux.alibaba.com,
	palmer@...belt.com,
	paul.walmsley@...ive.com,
	linux-riscv@...ts.infradead.org,
	linux-kernel@...r.kernel.org,
	linux-acpi@...r.kernel.org
Cc: james.morse@....com,
	tony.luck@...el.com,
	cleger@...osinc.com,
	hchauhan@...tanamicro.com,
	tianruidong@...ux.alibaba.com
Subject: [RFC PATCH 0/5] riscv: Handle synchronous hardware error exception

Hi all,
This patch series introduces support for handling synchronous hardware errors 
on RISC-V, laying the groundwork for more robust kernel-mode error recovery.

1. Background
Hardware error reporting mechanisms typically fall into two categories: 
asynchronous and synchronous.

- Asynchronous errors (e.g., memory scrubbing errors) repoted by a asynchronous
exceptions or a interrupt, are usually handled by GHES subsystems. For instance,
ARM uses SDEI, and a similar SSE specification is being proposed for RISC-V.
- Synchronous errors (e.g., reading poisoned data) cause the processor core to 
take a precise exception. This is known as a Synchronous External Abort (SEA)
on ARM, a Machine Check Exception (MCE) on x86, and is designated as trap with
mcause 19 on RISC-V.

Discussions within the RVI PRS TG have already led to proposals[0] to UEFI for 
standardizing two notification methods, SSE and Hardware Error Exception, 
on RISC-V. 
This series focuses on implementing Hardware Error Exception notification to
handle synchronous errors. Himanshu Chauhan has already started working on SSE[1].

2. Motivation
While a synchronous hardware errors occurring in kernel context (e.g., during 
get_user, put_user, CoW, etc.). The kernel requires a fixup mechanism (via
extable) to recover from such errors and prevent a system panic. However, the 
APEI/GHES subsystem, being asynchronous, cannot directly leverage the synchronous
extable fixup path.

By handling the synchronous exception directly, we enable the use of this fixup
mechanism, allowing the kernel to gracefully recover from hardware errors
encountered during kernel execution. This brings RISC-V's error handling
capabilities closer to the robustness found on ARM[2] and x86[3].

3. What This Patch Series Does
This initial series lays the foundational infrastructure. It primarily:
- Introduces a new exception handler for synchronous hardware errors (mcause=19).
- Establishes the core exception path, which is a prerequisite for kernel
  context error recovery.

Please note that this version does not yet implement the full kernel fixup logic
for recovery. That functionality is planned for the next formal version.

Some adaptations for GHES are included, based on the work from Himanshu Chauhan[1]

4. Future Plans
- Implement full kernel fixup support to handle and recover from errors in 
  some kernel context[2].
- Add support for handling "double trap" scenarios.

5. Testing Methodology

test program: ras-tools: https://kernel.googlesource.com/pub/scm/linux/kernel/git/aegl/ras-tools/
qemu: https://github.com/winterddd/qemu
offcial opensbi and edk2:

- Run qemu:
qemu-system-riscv64 -M virt,pflash0=pflash0,pflash1=pflash1,acpi=on,aia=aplic-imsic 
 -cpu max -m 64G -smp 64 -device virtio-gpu-pci -full-screen -device qemu-xhci 
 -device usb-kbd -device virtio-rng-pci 
 -blockdev node-name=pflash0,driver=file,read-only=on,filename=RISCV_VIRT_CODE.fd 
 -blockdev node-name=pflash1,driver=file,filename=RISCV_VIRT_VARS.fd 
 -bios fw_dynamic.bin -device virtio-net-device,netdev=net0 
 -netdev user,id=net0,hostfwd=tcp::2223-:22 
 -kernel Image -initrd rootfs
 -append "rdinit=/sbin/init earlycon verbose debug strict_devmem=0 nokaslr" 
 -monitor telnet:127.0.0.1:5557,server,nowait -nographic

- Run ras-tools:
./einj_mem_uc -j -k single &
$ 0: single   vaddr = 0x7fff86ff4400 paddr = 107d11b400

- Inject poison
telnet localhost 5557
poison_enable on
poison_add 0x107d11b400

- Read poison
echo trigger > ./trigger_start
$ triggering ...
$ signal 7 code 3 addr 0x7fff86ff4400

[0]: https://lists.riscv.org/g/tech-prs/topic/risc_v_ras_related_ecrs/113685653 
[1]: https://patchew.org/linux/20250227123628.2931490-1-hchauhan@ventanamicro.com/
[2]: https://lore.kernel.org/lkml/20241209024257.3618492-1-tongtiangen@huawei.com/
[3]: https://github.com/torvalds/linux/blob/9dd1835ecda5b96ac88c166f4a87386f3e727bd9/arch/x86/kernel/cpu/mce/core.c#L1514

Himanshu Chauhan (2):
  riscv: Define ioremap_cache for RISC-V
  riscv: Define arch_apei_get_mem_attribute for RISC-V

Ruidong Tian (3):
  acpi: Introduce SSE and HEE in HEST notification types
  riscv: Introduce HEST HEE notification handlers for APEI
  riscv: Add Hardware Error Exception trap handler

 arch/riscv/Kconfig              |  1 +
 arch/riscv/include/asm/acpi.h   | 22 +++++++++++++
 arch/riscv/include/asm/fixmap.h |  6 ++++
 arch/riscv/include/asm/io.h     |  3 ++
 arch/riscv/kernel/acpi.c        | 55 +++++++++++++++++++++++++++++++
 arch/riscv/kernel/entry.S       |  4 +++
 arch/riscv/kernel/traps.c       | 19 +++++++++++
 drivers/acpi/apei/Kconfig       | 12 +++++++
 drivers/acpi/apei/ghes.c        | 58 +++++++++++++++++++++++++++++++++
 include/acpi/actbl1.h           |  4 ++-
 include/acpi/ghes.h             |  6 ++++
 11 files changed, 189 insertions(+), 1 deletion(-)

-- 
2.43.7


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ