linux-kernel - Re: [RFC PATCH 0/5] riscv: Handle synchronous hardware error exception

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK9=C2UnN0VFnGCRSHaYrwzjHFu-PfWpWkBzNfJNW9wwM8UOvw@mail.gmail.com>
Date: Wed, 10 Sep 2025 22:50:47 +0530
From: Anup Patel <apatel@...tanamicro.com>
To: Ruidong Tian <tianruidong@...ux.alibaba.com>
Cc: xueshuai@...ux.alibaba.com, palmer@...belt.com, paul.walmsley@...ive.com, 
	linux-riscv@...ts.infradead.org, linux-kernel@...r.kernel.org, 
	linux-acpi@...r.kernel.org, james.morse@....com, tony.luck@...el.com, 
	cleger@...osinc.com, hchauhan@...tanamicro.com
Subject: Re: [RFC PATCH 0/5] riscv: Handle synchronous hardware error exception

On Wed, Sep 10, 2025 at 3:04 PM Ruidong Tian
<tianruidong@...ux.alibaba.com> wrote:
>
> Hi all,
> This patch series introduces support for handling synchronous hardware errors
> on RISC-V, laying the groundwork for more robust kernel-mode error recovery.
>
> 1. Background
> Hardware error reporting mechanisms typically fall into two categories:
> asynchronous and synchronous.
>
> - Asynchronous errors (e.g., memory scrubbing errors) repoted by a asynchronous
> exceptions or a interrupt, are usually handled by GHES subsystems. For instance,
> ARM uses SDEI, and a similar SSE specification is being proposed for RISC-V.
> - Synchronous errors (e.g., reading poisoned data) cause the processor core to
> take a precise exception. This is known as a Synchronous External Abort (SEA)
> on ARM, a Machine Check Exception (MCE) on x86, and is designated as trap with
> mcause 19 on RISC-V.
>
> Discussions within the RVI PRS TG have already led to proposals[0] to UEFI for
> standardizing two notification methods, SSE and Hardware Error Exception,
> on RISC-V.
> This series focuses on implementing Hardware Error Exception notification to
> handle synchronous errors. Himanshu Chauhan has already started working on SSE[1].
>
> 2. Motivation
> While a synchronous hardware errors occurring in kernel context (e.g., during
> get_user, put_user, CoW, etc.). The kernel requires a fixup mechanism (via
> extable) to recover from such errors and prevent a system panic. However, the
> APEI/GHES subsystem, being asynchronous, cannot directly leverage the synchronous
> extable fixup path.
>
> By handling the synchronous exception directly, we enable the use of this fixup
> mechanism, allowing the kernel to gracefully recover from hardware errors
> encountered during kernel execution. This brings RISC-V's error handling
> capabilities closer to the robustness found on ARM[2] and x86[3].
>
> 3. What This Patch Series Does
> This initial series lays the foundational infrastructure. It primarily:
> - Introduces a new exception handler for synchronous hardware errors (mcause=19).
> - Establishes the core exception path, which is a prerequisite for kernel
>   context error recovery.
>
> Please note that this version does not yet implement the full kernel fixup logic
> for recovery. That functionality is planned for the next formal version.
>
> Some adaptations for GHES are included, based on the work from Himanshu Chauhan[1]
>
> 4. Future Plans
> - Implement full kernel fixup support to handle and recover from errors in
>   some kernel context[2].
> - Add support for handling "double trap" scenarios.
>
> 5. Testing Methodology
>
> test program: ras-tools: https://kernel.googlesource.com/pub/scm/linux/kernel/git/aegl/ras-tools/
> qemu: https://github.com/winterddd/qemu
> offcial opensbi and edk2:
>
> - Run qemu:
> qemu-system-riscv64 -M virt,pflash0=pflash0,pflash1=pflash1,acpi=on,aia=aplic-imsic
>  -cpu max -m 64G -smp 64 -device virtio-gpu-pci -full-screen -device qemu-xhci
>  -device usb-kbd -device virtio-rng-pci
>  -blockdev node-name=pflash0,driver=file,read-only=on,filename=RISCV_VIRT_CODE.fd
>  -blockdev node-name=pflash1,driver=file,filename=RISCV_VIRT_VARS.fd
>  -bios fw_dynamic.bin -device virtio-net-device,netdev=net0
>  -netdev user,id=net0,hostfwd=tcp::2223-:22
>  -kernel Image -initrd rootfs
>  -append "rdinit=/sbin/init earlycon verbose debug strict_devmem=0 nokaslr"
>  -monitor telnet:127.0.0.1:5557,server,nowait -nographic
>
> - Run ras-tools:
> ./einj_mem_uc -j -k single &
> $ 0: single   vaddr = 0x7fff86ff4400 paddr = 107d11b400
>
> - Inject poison
> telnet localhost 5557
> poison_enable on
> poison_add 0x107d11b400
>
> - Read poison
> echo trigger > ./trigger_start
> $ triggering ...
> $ signal 7 code 3 addr 0x7fff86ff4400
>
> [0]: https://lists.riscv.org/g/tech-prs/topic/risc_v_ras_related_ecrs/113685653
> [1]: https://patchew.org/linux/20250227123628.2931490-1-hchauhan@ventanamicro.com/
> [2]: https://lore.kernel.org/lkml/20241209024257.3618492-1-tongtiangen@huawei.com/
> [3]: https://github.com/torvalds/linux/blob/9dd1835ecda5b96ac88c166f4a87386f3e727bd9/arch/x86/kernel/cpu/mce/core.c#L1514
>
> Himanshu Chauhan (2):
>   riscv: Define ioremap_cache for RISC-V
>   riscv: Define arch_apei_get_mem_attribute for RISC-V
>
> Ruidong Tian (3):
>   acpi: Introduce SSE and HEE in HEST notification types
>   riscv: Introduce HEST HEE notification handlers for APEI
>   riscv: Add Hardware Error Exception trap handler
>

Himanshu had already sent-out RFC v1 way back in Feb 2025 [1] which
did not receive any comments or feedback.

Instead of sending out a half-baked series, it will be helpful if you
can review Himanshu's series.

Regards,
Anup

[1] https://patchew.org/linux/20250227123628.2931490-1-hchauhan@ventanamicro.com/