[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dee8d758-dd65-4438-8e42-251fb1a305a7@linux.alibaba.com>
Date: Sat, 1 Mar 2025 22:03:13 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: Borislav Petkov <bp@...en8.de>
Cc: tony.luck@...el.com, nao.horiguchi@...il.com, tglx@...utronix.de,
mingo@...hat.com, dave.hansen@...ux.intel.com, x86@...nel.org,
hpa@...or.com, linmiaohe@...wei.com, akpm@...ux-foundation.org,
peterz@...radead.org, jpoimboe@...nel.org, linux-edac@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
baolin.wang@...ux.alibaba.com, tianruidong@...ux.alibaba.com
Subject: Re: [PATCH v2 2/5] x86/mce: dump error msg from severities
在 2025/3/1 19:10, Borislav Petkov 写道:
> On Sat, Mar 01, 2025 at 02:16:12PM +0800, Shuai Xue wrote:
>> For instance, it does not specify whether the error occurred in the
>> context of IN_KERNEL or IN_KERNEL_RECOV, which are crucial for
>> understanding the error's circumstances.
>
> 1. Crucial for whom? For you? Or for users?
>
> You need to explain how this error message is going to be used. Because simply
> issuing such a message causes a lot of panicked people calling a lot of admins
> to figure out why their machine is broken. Because they see "mce" and think
> "hw broken, need to replace it immediately."
>
> This is one of the reasons we did the cec.c thing - just to save people from
> panicking unnecessarily and causing expensive and useless maintenance calls.
For me, and cloud providers which maintains million servers.
(By the way, Cenots/Redhat build kernel without CONFIG_RAS_CEC set, becase
it breaks EDAC decoding. We do not use CEC in production at all for the same
reasion.)
>
> 2. This message goes to dmesg which means something needs to parse it, beside
> a human. An AI?
Yes, we collect all kernel message from host, parse the logs and predict panic
with AI tools. The more details we collect, the better the performance of
the AI model.
>
> 3. Dmesg is a ring buffer which gets overwritten and this message is
> eventually lost
>
> There's a reason why MCEs get logged with the notifiers and through
> a tracepoint - so that agents can act upon them properly.
>
> And we have had this discussion for years now - I'm sorry that you're late to
> the party.
Agreed, tracepoint is a more elegant way. However, it does not include error context,
just some hardware registers.
>
>> For the regression cases (copy from user) in Patch 3, an error message
>>
>> "mce: Action required: data load in error recoverable area of kernel"
>
> See above.
>
> Besides, this message is completely useless as it has no concrete info about
> the error and what is being done about it.
I don't think so,
"Action required" means MCI_UC_AR
"data load" means MCACOD_DATA
"recoverable area of kernel" means KERNEL_RECOV
It is more readable and concrete than "Uncorrected hardware memory error", e.g.
message in kill_me_maybe():
"mce: Uncorrected hardware memory error in user-access at 3b116c400"
>
>> I could add more explanations in next version if you have no objection.
>
> All of the above are objections.
>
> Please go into git history and read why we're avoiding dumping useless
> messages instead of proposing silly patches.
>
Anyway, I respect the maintainer's opinion.
Thanks
Shuai
Powered by blists - more mailing lists