linux-kernel - Re: [PATCH] ACPI: APEI: Handle repeated SEA error interrupts storm scenarios

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0h=QtcT7zhZEgrTjUk7EAk2OfbGG6BoEEv-3toKODMXQA@mail.gmail.com>
Date: Mon, 3 Nov 2025 17:19:43 +0100
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Junhao He <hejunhao3@...artners.com>
Cc: rafael@...nel.org, tony.luck@...el.com, bp@...en8.de, guohanjun@...wei.com, 
	mchehab@...nel.org, xueshuai@...ux.alibaba.com, jarkko@...nel.org, 
	yazen.ghannam@....com, jane.chu@...cle.com, lenb@...nel.org, 
	Jonathan.Cameron@...wei.com, linux-acpi@...r.kernel.org, 
	linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org, 
	linux-edac@...r.kernel.org, shiju.jose@...wei.com, tanxiaofei@...wei.com, 
	linuxarm@...wei.com
Subject: Re: [PATCH] ACPI: APEI: Handle repeated SEA error interrupts storm scenarios

On Thu, Oct 30, 2025 at 8:13 AM Junhao He <hejunhao3@...artners.com> wrote:
>
> The do_sea() function defaults to using firmware-first mode, if supported.
> It invoke acpi/apei/ghes ghes_notify_sea() to report and handling the SEA
> error, The GHES uses a buffer to cache the most recent 4 kinds of SEA
> errors. If the same kind SEA error continues to occur, GHES will skip to
> reporting this SEA error and will not add it to the "ghes_estatus_llist"
> list until the cache times out after 10 seconds, at which point the SEA
> error will be reprocessed.
>
> The GHES invoke ghes_proc_in_irq() to handle the SEA error, which
> ultimately executes memory_failure() to process the page with hardware
> memory corruption. If the same SEA error appears multiple times
> consecutively, it indicates that the previous handling was incomplete or
> unable to resolve the fault. In such cases, it is more appropriate to
> return a failure when encountering the same error again, and then proceed
> to arm64_do_kernel_sea for further processing.
>
> When hardware memory corruption occurs, a memory error interrupt is
> triggered. If the kernel accesses this erroneous data, it will trigger
> the SEA error exception handler. All such handlers will call
> memory_failure() to handle the faulty page.
>
> If a memory error interrupt occurs first, followed by an SEA error
> interrupt, the faulty page is first marked as poisoned by the memory error
> interrupt process, and then the SEA error interrupt handling process will
> send a SIGBUS signal to the process accessing the poisoned page.
>
> However, if the SEA interrupt is reported first, the following exceptional
> scenario occurs:
>
> When a user process directly requests and accesses a page with hardware
> memory corruption via mmap (such as with devmem), the page containing this
> address may still be in a free buddy state in the kernel. At this point,
> the page is marked as "poisoned" during the SEA claim memory_failure().
> However, since the process does not request the page through the kernel's
> MMU, the kernel cannot send SIGBUS signal to the processes. And the memory
> error interrupt handling process not support send SIGBUS signal. As a
> result, these processes continues to access the faulty page, causing
> repeated entries into the SEA exception handler. At this time, it lead to
> an SEA error interrupt storm.
>
> Fixes this by returning a failure when encountering the same error again.
>
> The following error logs is explained using the devmem process:
>   NOTICE:  SEA Handle
>   NOTICE:  SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
>   NOTICE:  skt[0x0]die[0x0]cluster[0x0]core[0x1]
>   NOTICE:  EsrEl3 = 0x92000410
>   NOTICE:  PA is valid: 0x1000093c00
>   NOTICE:  Hest Set GenericError Data
>   [ 1419.542401][    C1] {57}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
>   [ 1419.551435][    C1] {57}[Hardware Error]: event severity: recoverable
>   [ 1419.557865][    C1] {57}[Hardware Error]:  Error 0, type: recoverable
>   [ 1419.564295][    C1] {57}[Hardware Error]:   section_type: ARM processor error
>   [ 1419.571421][    C1] {57}[Hardware Error]:   MIDR: 0x0000000000000000
>   [ 1419.571434][    C1] {57}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081000100
>   [ 1419.586813][    C1] {57}[Hardware Error]:   error affinity level: 0
>   [ 1419.586821][    C1] {57}[Hardware Error]:   running state: 0x1
>   [ 1419.602714][    C1] {57}[Hardware Error]:   Power State Coordination Interface state: 0
>   [ 1419.602724][    C1] {57}[Hardware Error]:   Error info structure 0:
>   [ 1419.614797][    C1] {57}[Hardware Error]:   num errors: 1
>   [ 1419.614804][    C1] {57}[Hardware Error]:    error_type: 0, cache error
>   [ 1419.629226][    C1] {57}[Hardware Error]:    error_info: 0x0000000020400014
>   [ 1419.629234][    C1] {57}[Hardware Error]:     cache level: 1
>   [ 1419.642006][    C1] {57}[Hardware Error]:     the error has not been corrected
>   [ 1419.642013][    C1] {57}[Hardware Error]:    physical fault address: 0x0000001000093c00
>   [ 1419.654001][    C1] {57}[Hardware Error]:   Vendor specific error info has 48 bytes:
>   [ 1419.654014][    C1] {57}[Hardware Error]:    00000000: 00000000 00000000 00000000 00000000  ................
>   [ 1419.670685][    C1] {57}[Hardware Error]:    00000010: 00000000 00000000 00000000 00000000  ................
>   [ 1419.670692][    C1] {57}[Hardware Error]:    00000020: 00000000 00000000 00000000 00000000  ................
>   [ 1419.783606][T54990] Memory failure: 0x1000093: recovery action for free buddy page: Recovered
>   [ 1419.919580][ T9955] EDAC MC0: 1 UE Multi-bit ECC on unknown memory (node:0 card:1 module:71 bank:7 row:0 col:0 page:0x1000093 offset:0xc00 grain:1 - APEI location: node:0 card:257 module:71 bank:7 row:0 col:0)
>   NOTICE:  SEA Handle
>   NOTICE:  SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
>   NOTICE:  skt[0x0]die[0x0]cluster[0x0]core[0x1]
>   NOTICE:  EsrEl3 = 0x92000410
>   NOTICE:  PA is valid: 0x1000093c00
>   NOTICE:  Hest Set GenericError Data
>   NOTICE:  SEA Handle
>   NOTICE:  SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
>   NOTICE:  skt[0x0]die[0x0]cluster[0x0]core[0x1]
>   NOTICE:  EsrEl3 = 0x92000410
>   NOTICE:  PA is valid: 0x1000093c00
>   NOTICE:  Hest Set GenericError Data
>   ...
>   ...        ---> Hapend SEA error interrupt storm
>   ...
>   NOTICE:  SEA Handle
>   NOTICE:  SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
>   NOTICE:  skt[0x0]die[0x0]cluster[0x0]core[0x1]
>   NOTICE:  EsrEl3 = 0x92000410
>   NOTICE:  PA is valid: 0x1000093c00
>   NOTICE:  Hest Set GenericError Data
>   [ 1429.818080][ T9955] Memory failure: 0x1000093: already hardware poisoned
>   [ 1429.825760][    C1] ghes_print_estatus: 1 callbacks suppressed
>   [ 1429.825763][    C1] {59}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
>   [ 1429.843731][    C1] {59}[Hardware Error]: event severity: recoverable
>   [ 1429.861800][    C1] {59}[Hardware Error]:  Error 0, type: recoverable
>   [ 1429.874658][    C1] {59}[Hardware Error]:   section_type: ARM processor error
>   [ 1429.887516][    C1] {59}[Hardware Error]:   MIDR: 0x0000000000000000
>   [ 1429.901159][    C1] {59}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081000100
>   [ 1429.901166][    C1] {59}[Hardware Error]:   error affinity level: 0
>   [ 1429.914896][    C1] {59}[Hardware Error]:   running state: 0x1
>   [ 1429.914903][    C1] {59}[Hardware Error]:   Power State Coordination Interface state: 0
>   [ 1429.933319][    C1] {59}[Hardware Error]:   Error info structure 0:
>   [ 1429.946261][    C1] {59}[Hardware Error]:   num errors: 1
>   [ 1429.946269][    C1] {59}[Hardware Error]:    error_type: 0, cache error
>   [ 1429.970847][    C1] {59}[Hardware Error]:    error_info: 0x0000000020400014
>   [ 1429.970854][    C1] {59}[Hardware Error]:     cache level: 1
>   [ 1429.988406][    C1] {59}[Hardware Error]:     the error has not been corrected
>   [ 1430.013419][    C1] {59}[Hardware Error]:    physical fault address: 0x0000001000093c00
>   [ 1430.013425][    C1] {59}[Hardware Error]:   Vendor specific error info has 48 bytes:
>   [ 1430.025424][    C1] {59}[Hardware Error]:    00000000: 00000000 00000000 00000000 00000000  ................
>   [ 1430.053736][    C1] {59}[Hardware Error]:    00000010: 00000000 00000000 00000000 00000000  ................
>   [ 1430.066341][    C1] {59}[Hardware Error]:    00000020: 00000000 00000000 00000000 00000000  ................
>   [ 1430.294255][T54990] Memory failure: 0x1000093: already hardware poisoned
>   [ 1430.305518][T54990] 0x1000093: Sending SIGBUS to devmem:54990 due to hardware memory corruption
>
> Signed-off-by: Junhao He <hejunhao3@...artners.com>
> ---
>  drivers/acpi/apei/ghes.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 005de10d80c3..eebda39bfc30 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -1343,8 +1343,10 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>         ghes_clear_estatus(ghes, &tmp_header, buf_paddr, fixmap_idx);
>
>         /* This error has been reported before, don't process it again. */
> -       if (ghes_estatus_cached(estatus))
> +       if (ghes_estatus_cached(estatus)) {
> +               rc = -ECANCELED;
>                 goto no_work;
> +       }
>
>         llist_add(&estatus_node->llnode, &ghes_estatus_llist);
>
> --

This needs a response from the APEI reviewers as per MAINTAINERS, thanks!