[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231125121059.GAZWHkU27odMLns7TZ@fat_crate.local>
Date: Sat, 25 Nov 2023 13:10:59 +0100
From: Borislav Petkov <bp@...en8.de>
To: Shuai Xue <xueshuai@...ux.alibaba.com>
Cc: rafael@...nel.org, wangkefeng.wang@...wei.com,
tanxiaofei@...wei.com, mawupeng1@...wei.com, tony.luck@...el.com,
linmiaohe@...wei.com, naoya.horiguchi@....com, james.morse@....com,
gregkh@...uxfoundation.org, will@...nel.org, jarkko@...nel.org,
linux-acpi@...r.kernel.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, akpm@...ux-foundation.org,
linux-edac@...r.kernel.org, acpica-devel@...ts.linuxfoundation.org,
stable@...r.kernel.org, x86@...nel.org, justin.he@....com,
ardb@...nel.org, ying.huang@...el.com, ashish.kalra@....com,
baolin.wang@...ux.alibaba.com, tglx@...utronix.de,
mingo@...hat.com, dave.hansen@...ux.intel.com, lenb@...nel.org,
hpa@...or.com, robert.moore@...el.com, lvying6@...wei.com,
xiexiuqi@...wei.com, zhuo.song@...ux.alibaba.com
Subject: Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task
work with proper si_code
On Sat, Nov 25, 2023 at 02:44:52PM +0800, Shuai Xue wrote:
> - an AR error consumed by current process is deferred to handle in a
> dedicated kernel thread, but memory_failure() assumes that it runs in the
> current context
On x86? ARM?
Please point to the exact code flow.
> - another page fault is not unnecessary, we can send sigbus to current
> process in the first Synchronous External Abort SEA on arm64 (analogy
> Machine Check Exception on x86)
I have no clue what that means. What page fault?
> I just give an example that the user space process *really* relys on the
> si_code of signal to handle hardware errors
No, don't give examples.
Explain what the exact problem is you're seeing, in your use case, point
to the code and then state how you think it should be fixed and why.
Right now your text is "all over the place" and I have no clue what you
even want.
> The SIGBUS si_codes defined in include/uapi/asm-generic/siginfo.h says:
>
> /* hardware memory error consumed on a machine check: action required */
> #define BUS_MCEERR_AR 4
> /* hardware memory error detected in process but not consumed: action optional*/
> #define BUS_MCEERR_AO 5
>
> When a synchronous error is consumed by Guest, the kernel should send a
> signal with BUS_MCEERR_AR instead of BUS_MCEERR_AO.
Can you drop this "synchronous" bla and concentrate on the error
*severity*?
I think you want to say that there are some types of errors for which
error handling needs to happen immediately and for some reason that
doesn't happen.
Which errors are those? Types?
Why do you need them to be handled immediately?
> Exactly.
No, not exactly. Why is it ok to do that? What are the implications of
this?
Is immediate killing the right decision?
Is this ok for *every* possible kernel running out there - not only for
your use case?
And so on and so on...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Powered by blists - more mailing lists