[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9e92e600-86a4-4456-9de4-b597854b107c@linux.alibaba.com>
Date: Sat, 25 Nov 2023 14:44:52 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: Borislav Petkov <bp@...en8.de>
Cc: rafael@...nel.org, wangkefeng.wang@...wei.com,
tanxiaofei@...wei.com, mawupeng1@...wei.com, tony.luck@...el.com,
linmiaohe@...wei.com, naoya.horiguchi@....com, james.morse@....com,
gregkh@...uxfoundation.org, will@...nel.org, jarkko@...nel.org,
linux-acpi@...r.kernel.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, akpm@...ux-foundation.org,
linux-edac@...r.kernel.org, acpica-devel@...ts.linuxfoundation.org,
stable@...r.kernel.org, x86@...nel.org, justin.he@....com,
ardb@...nel.org, ying.huang@...el.com, ashish.kalra@....com,
baolin.wang@...ux.alibaba.com, tglx@...utronix.de,
mingo@...hat.com, dave.hansen@...ux.intel.com, lenb@...nel.org,
hpa@...or.com, robert.moore@...el.com, lvying6@...wei.com,
xiexiuqi@...wei.com, zhuo.song@...ux.alibaba.com
Subject: Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work
with proper si_code
On 2023/11/23 23:07, Borislav Petkov wrote:
Hi, Borislav,
Thank you for your reply and advice.
> On Sat, Oct 07, 2023 at 03:28:16PM +0800, Shuai Xue wrote:
>> However, this trick is not always be effective
>
> So far so good.
>
> What's missing here is why "this trick" is not always effective.
>
> Basically to explain what exactly the problem is.
I think the main point is that this trick for AR error is not effective,
because:
- an AR error consumed by current process is deferred to handle in a
dedicated kernel thread, but memory_failure() assumes that it runs in the
current context
- another page fault is not unnecessary, we can send sigbus to current
process in the first Synchronous External Abort SEA on arm64 (analogy
Machine Check Exception on x86)
>
>> For example, hwpoison-aware user-space processes use the si_code:
>> BUS_MCEERR_AO for 'action optional' early notifications, and BUS_MCEERR_AR
>> for 'action required' synchronous/late notifications. Specifically, when a
>> signal with SIGBUS_MCEERR_AR is delivered to QEMU, it will inject a vSEA to
>> Guest kernel. In contrast, a signal with SIGBUS_MCEERR_AO will be ignored
>> by QEMU.[1]
>>
>> Fix it by seting memory failure flags as MF_ACTION_REQUIRED on synchronous events. (PATCH 1)
>
> So you're fixing qemu by "fixing" the kernel?
>
> This doesn't make any sense.
I just give an example that the user space process *really* relys on the
si_code of signal to handle hardware errors
>
> Make errors which are ACPI_HEST_NOTIFY_SEA type return
> MF_ACTION_REQUIRED so that it *happens* to fix your use case.
>
> Sounds like a lot of nonsense to me.
>
> What is the issue here you're trying to solve?
The SIGBUS si_codes defined in include/uapi/asm-generic/siginfo.h says:
/* hardware memory error consumed on a machine check: action required */
#define BUS_MCEERR_AR 4
/* hardware memory error detected in process but not consumed: action optional*/
#define BUS_MCEERR_AO 5
When a synchronous error is consumed by Guest, the kernel should send a
signal with BUS_MCEERR_AR instead of BUS_MCEERR_AO.
>
>> 2. Handle memory_failure() abnormal fails to avoid a unnecessary reboot
>>
>> If process mapping fault page, but memory_failure() abnormal return before
>> try_to_unmap(), for example, the fault page process mapping is KSM page.
>> In this case, arm64 cannot use the page fault process to terminate the
>> synchronous exception loop.[4]
>>
>> This loop can potentially exceed the platform firmware threshold or even trigger
>> a kernel hard lockup, leading to a system reboot. However, kernel has the
>> capability to recover from this error.
>>
>> Fix it by performing a force kill when memory_failure() abnormal fails or when
>> other abnormal synchronous errors occur.
>
> Just like that?
>
> Without giving the process the opportunity to even save its other data?
Exactly.
>
> So this all is still very confusing, patches definitely need splitting
> and this whole thing needs restraint.
>
> You go and do this: you split *each* issue you're addressing into
> a separate patch and explain it like this:
>
> ---
> 1. Prepare the context for the explanation briefly.
>
> 2. Explain the problem at hand.
>
> 3. "It happens because of <...>"
>
> 4. "Fix it by doing X"
>
> 5. "(Potentially do Y)."
> ---
>
> and each patch explains *exactly* *one* issue, what happens, why it
> happens and just the fix for it and *why* it is needed.
>
> Otherwise, this is unreviewable.
Thank you for your valuable suggestion, I will split the patches and
resubmit a new patch set.
>
> Thx.
>
Best Regards,
Shuai
Powered by blists - more mailing lists