[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <57bd6874-35df-48b0-90d8-45077396b44f@linux.alibaba.com>
Date: Tue, 21 Nov 2023 09:48:28 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: rafael@...nel.org, wangkefeng.wang@...wei.com,
tanxiaofei@...wei.com, mawupeng1@...wei.com, tony.luck@...el.com,
linmiaohe@...wei.com, naoya.horiguchi@....com, james.morse@....com,
gregkh@...uxfoundation.org, will@...nel.org, jarkko@...nel.org
Cc: linux-acpi@...r.kernel.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, akpm@...ux-foundation.org,
linux-edac@...r.kernel.org, acpica-devel@...ts.linuxfoundation.org,
stable@...r.kernel.org, x86@...nel.org, justin.he@....com,
ardb@...nel.org, ying.huang@...el.com, ashish.kalra@....com,
baolin.wang@...ux.alibaba.com, bp@...en8.de, tglx@...utronix.de,
mingo@...hat.com, dave.hansen@...ux.intel.com, lenb@...nel.org,
hpa@...or.com, robert.moore@...el.com, lvying6@...wei.com,
xiexiuqi@...wei.com, zhuo.song@...ux.alibaba.com
Subject: Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work
with proper si_code
Hi, ALL,
Gentle ping.
Best Regards,
Shuai
On 2023/10/7 15:28, Shuai Xue wrote:
> Hi, ALL,
>
> I have rewritten the cover letter with the hope that the maintainer will truly
> understand the necessity of this patch. Both Alibaba and Huawei met the same
> issue in products, and we hope it could be fixed ASAP.
>
> ## Changes Log
>
> changes since v8:
> - remove the bug fix tag of patch 2 (per Jarkko Sakkinen)
> - remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi)
> - rewrite the return value comments of memory_failure (per Naoya Horiguchi)
>
> changes since v7:
> - rebase to Linux v6.6-rc2 (no code changed)
> - rewritten the cover letter to explain the motivation of this patchset
>
> changes since v6:
> - add more explicty error message suggested by Xiaofei
> - pick up reviewed-by tag from Xiaofei
> - pick up internal reviewed-by tag from Baolin
>
> changes since v5 by addressing comments from Kefeng:
> - document return value of memory_failure()
> - drop redundant comments in call site of memory_failure()
> - make ghes_do_proc void and handle abnormal case within it
> - pick up reviewed-by tag from Kefeng Wang
>
> changes since v4 by addressing comments from Xiaofei:
> - do a force kill only for abnormal sync errors
>
> changes since v3 by addressing comments from Xiaofei:
> - do a force kill for abnormal memory failure error such as invalid PA,
> unexpected severity, OOM, etc
> - pcik up tested-by tag from Ma Wupeng
>
> changes since v2 by addressing comments from Naoya:
> - rename mce_task_work to sync_task_work
> - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
> - add steps to reproduce this problem in cover letter
>
> changes since v1:
> - synchronous events by notify type
> - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/
>
>
> ## Cover Letter
>
> There are two major types of uncorrected recoverable (UCR) errors :
>
> - Action Required (AR): The error is detected and the processor already
> consumes the memory. OS requires to take action (for example, offline
> failure page/kill failure thread) to recover this error.
>
> - Action Optional (AO): The error is detected out of processor execution
> context. Some data in the memory are corrupted. But the data have not
> been consumed. OS is optional to take action to recover this error.
>
> The main difference between AR and AO errors is that AR errors are synchronous
> events, while AO errors are asynchronous events. Synchronous exceptions, such as
> Machine Check Exception (MCE) on X86 and Synchronous External Abort (SEA) on
> Arm64, are signaled by the hardware when an error is detected and the memory
> access has architecturally been executed.
>
> Currently, both synchronous and asynchronous errors are queued as AO errors and
> handled by a dedicated kernel thread in a work queue on the ARM64 platform. For
> synchronous errors, memory_failure() is synced using a cancel_work_sync trick to
> ensure that the corrupted page is unmapped and poisoned. Upon returning to
> user-space, the process resumes at the current instruction, triggering a page
> fault. As a result, the kernel sends a SIGBUS signal to the current process due
> to VM_FAULT_HWPOISON.
>
> However, this trick is not always be effective, this patch set improves the
> recovery process in three specific aspects:
>
> 1. Handle synchronous exceptions with proper si_code
>
> ghes_handle_memory_failure() queue both synchronous and asynchronous errors with
> flag=0. Then the kernel will notify the process by sending a SIGBUS signal in
> memory_failure() with wrong si_code: BUS_MCEERR_AO to the actual user-space
> process instead of BUS_MCEERR_AR. The user-space processes rely on the si_code
> to distinguish to handle memory failure.
>
> For example, hwpoison-aware user-space processes use the si_code:
> BUS_MCEERR_AO for 'action optional' early notifications, and BUS_MCEERR_AR
> for 'action required' synchronous/late notifications. Specifically, when a
> signal with SIGBUS_MCEERR_AR is delivered to QEMU, it will inject a vSEA to
> Guest kernel. In contrast, a signal with SIGBUS_MCEERR_AO will be ignored
> by QEMU.[1]
>
> Fix it by seting memory failure flags as MF_ACTION_REQUIRED on synchronous events. (PATCH 1)
>
> 2. Handle memory_failure() abnormal fails to avoid a unnecessary reboot
>
> If process mapping fault page, but memory_failure() abnormal return before
> try_to_unmap(), for example, the fault page process mapping is KSM page.
> In this case, arm64 cannot use the page fault process to terminate the
> synchronous exception loop.[4]
>
> This loop can potentially exceed the platform firmware threshold or even trigger
> a kernel hard lockup, leading to a system reboot. However, kernel has the
> capability to recover from this error.
>
> Fix it by performing a force kill when memory_failure() abnormal fails or when
> other abnormal synchronous errors occur. These errors can include situations
> such as invalid PA, unexpected severity, no memory failure config support,
> invalid GUID section, OOM, etc. (PATCH 2)
>
> 3. Handle memory_failure() in current process context which consuming poison
>
> When synchronous errors occur, memory_failure() assume that current process
> context is exactly that consuming poison synchronous error.
>
> For example, kill_accessing_process() holds mmap locking of current->mm, does
> pagetable walk to find the error virtual address, and sends SIGBUS to the
> current process with error info. However, the mm of kworker is not valid,
> resulting in a null-pointer dereference. I have fixed this in[3].
>
> commit 77677cdbc2aa mm,hwpoison: check mm when killing accessing process
>
> Another example is that collect_procs()/kill_procs() walk the task list, only
> collect and send sigbus to task which consuming poison. But memory_failure() is
> queued and handled by a dedicated kernel thread on arm64 platform.
>
> Fix it by queuing memory_failure() as a task work which runs in current
> execution context to synchronously send SIGBUS before ret_to_user. (PATCH 2)
>
> ** In summary, this patch set handles synchronous errors in task work with
> proper si_code so that hwpoison-aware process can recover from errors, and
> fixes (potentially) abnormal cases. **
>
> Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4].
> Acknowledge to discussion with them.
>
> ## Steps to Reproduce This Problem
>
> To reproduce this problem:
>
> # STEP1: enable early kill mode
> #sysctl -w vm.memory_failure_early_kill=1
> vm.memory_failure_early_kill = 1
>
> # STEP2: inject an UCE error and consume it to trigger a synchronous error
> #einj_mem_uc single
> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
> injecting ...
> triggering ...
> signal 7 code 5 addr 0xffffb0d75000
> page not present
> Test passed
>
> The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error
> and it is not fact.
>
> After this patch set:
>
> # STEP1: enable early kill mode
> #sysctl -w vm.memory_failure_early_kill=1
> vm.memory_failure_early_kill = 1
>
> # STEP2: inject an UCE error and consume it to trigger a synchronous error
> #einj_mem_uc single
> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
> injecting ...
> triggering ...
> signal 7 code 4 addr 0xffffb0d75000
> page not present
> Test passed
>
> The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error
> as we expected.
>
> [1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/20200512030609.19593-1-gengdongjiu@huawei.com/
> [2] https://lore.kernel.org/lkml/20221205115111.131568-3-lvying6@huawei.com/
> [3] https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com
> [4] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/
>
> Shuai Xue (2):
> ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
> synchronous events
> ACPI: APEI: handle synchronous exceptions in task work
>
> arch/x86/kernel/cpu/mce/core.c | 9 +--
> drivers/acpi/apei/ghes.c | 113 ++++++++++++++++++++++-----------
> include/acpi/ghes.h | 3 -
> include/linux/mm.h | 1 -
> mm/memory-failure.c | 22 ++-----
> 5 files changed, 82 insertions(+), 66 deletions(-)
>
Powered by blists - more mailing lists