linux-kernel - Re: [PATCH v2 2/5] x86/mce: dump error msg from severities

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <dee8d758-dd65-4438-8e42-251fb1a305a7@linux.alibaba.com>
Date: Sat, 1 Mar 2025 22:03:13 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: Borislav Petkov <bp@...en8.de>
Cc: tony.luck@...el.com, nao.horiguchi@...il.com, tglx@...utronix.de,
 mingo@...hat.com, dave.hansen@...ux.intel.com, x86@...nel.org,
 hpa@...or.com, linmiaohe@...wei.com, akpm@...ux-foundation.org,
 peterz@...radead.org, jpoimboe@...nel.org, linux-edac@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 baolin.wang@...ux.alibaba.com, tianruidong@...ux.alibaba.com
Subject: Re: [PATCH v2 2/5] x86/mce: dump error msg from severities



在 2025/3/1 19:10, Borislav Petkov 写道:
> On Sat, Mar 01, 2025 at 02:16:12PM +0800, Shuai Xue wrote:
>> For instance, it does not specify whether the error occurred in the
>> context of IN_KERNEL or IN_KERNEL_RECOV, which are crucial for
>> understanding the error's circumstances.
> 
> 1. Crucial for whom? For you? Or for users?
> 
> You need to explain how this error message is going to be used. Because simply
> issuing such a message causes a lot of panicked people calling a lot of admins
> to figure out why their machine is broken. Because they see "mce" and think
> "hw broken, need to replace it immediately."
> 
> This is one of the reasons we did the cec.c thing - just to save people from
> panicking unnecessarily and causing expensive and useless maintenance calls.


For me, and cloud providers which maintains million servers.

(By the way, Cenots/Redhat build kernel without CONFIG_RAS_CEC set, becase
it breaks EDAC decoding. We do not use CEC in production at all for the same
reasion.)

> 
> 2. This message goes to dmesg which means something needs to parse it, beside
>     a human. An AI?

Yes, we collect all kernel message from host, parse the logs and predict panic
with AI tools. The more details we collect, the better the performance of
the AI model.

> 
> 3. Dmesg is a ring buffer which gets overwritten and this message is
>     eventually lost
> 
> There's a reason why MCEs get logged with the notifiers and through
> a tracepoint - so that agents can act upon them properly.
> 
> And we have had this discussion for years now - I'm sorry that you're late to
> the party.

Agreed, tracepoint is a more elegant way. However, it does not include error context,
just some hardware registers.

> 
>> For the regression cases (copy from user) in Patch 3, an error message
>>
>>      "mce: Action required: data load in error recoverable area of kernel"
> 
> See above.
> 
> Besides, this message is completely useless as it has no concrete info about
> the error and what is being done about it.

I don't think so,

"Action required" means MCI_UC_AR
"data load" means MCACOD_DATA
"recoverable area of kernel" means KERNEL_RECOV

It is more readable and concrete than "Uncorrected hardware memory error", e.g.
message in kill_me_maybe():

     "mce: Uncorrected hardware memory error in user-access at 3b116c400"

> 
>> I could add more explanations in next version if you have no objection.
> 
> All of the above are objections.
> 
> Please go into git history and read why we're avoiding dumping useless
> messages instead of proposing silly patches.
> 

Anyway, I respect the maintainer's opinion.

Thanks
Shuai