linux-kernel - Re: [PATCH v2 2/5] x86/mce: dump error msg from severities

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <7eddced6-bf45-44c8-abbf-7d0d541511ab@linux.alibaba.com>
Date: Sun, 2 Mar 2025 15:14:52 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: Borislav Petkov <bp@...en8.de>, "Luck, Tony" <tony.luck@...el.com>
Cc: nao.horiguchi@...il.com, tglx@...utronix.de, mingo@...hat.com,
 dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com,
 linmiaohe@...wei.com, akpm@...ux-foundation.org, peterz@...radead.org,
 jpoimboe@...nel.org, linux-edac@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 baolin.wang@...ux.alibaba.com, tianruidong@...ux.alibaba.com
Subject: Re: [PATCH v2 2/5] x86/mce: dump error msg from severities



在 2025/3/2 02:47, Borislav Petkov 写道:
> On Sat, Mar 01, 2025 at 10:03:13PM +0800, Shuai Xue wrote:
>> (By the way, Cenots/Redhat build kernel without CONFIG_RAS_CEC set, becase
>> it breaks EDAC decoding. We do not use CEC in production at all for the same
>> reasion.)
> 
> It doesn't "break" error decoding - it collects every correctable DRAM error
> and puts it in "leaky" bucket of sorts. And when a certain error address
> generates too many errors, it memory_failure()s the page and poisons it.
> 
> You do not use it in production because you want to see every error, collect
> it, massage it and perhaps decide when DIMMs go bad and you can replace
> them... or whatever you do.
> 
> All the others who enable it and we can sleep properly, without getting
> unnecessarily upset about a correctable error.

Yes, we want to see event CE error and use the CE pattern (e.g. correctable
error-bit)[1][2] to  predict whether a row fault is prone to UEs or not.
And we are not upset to CE error, becasue it have corrected by hardware :)

[1]https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fault-aware-prediction-guide.pdf
[2]https://arxiv.org/html/2312.02855v2

> 
>> Yes, we collect all kernel message from host, parse the logs and predict panic
>> with AI tools. The more details we collect, the better the performance of
>> the AI model.
> 
> LOL.
> 
> We go the great effort of going a MCE tracepoint which gives a *structured*
> error record, show an example how to use
> it in rasdaemon and you go and do the crazy hard and, at the same time, silly
> thing and parse dmesg?!??!
> 
> This is priceless. Oh boy.
> 
>> Agreed, tracepoint is a more elegant way. However, it does not include error
>> context, just some hardware registers.
> 
> The error context is in the behavior of the hw. If the error is fatal, you
> won't see it - the machine will panic or do something else to prevent error
> propagation. It definitely won't run any software anymore.
> 
> If you see the error getting logged, it means it is not fatal enough to kill
> the machine.

Agreed.

> 
>>> Besides, this message is completely useless as it has no concrete info about
>>> the error and what is being done about it.
>>
>> I don't think so,
> 
> I think so and you're not reading my mail.
> 
>>      "mce: Uncorrected hardware memory error in user-access at 3b116c400"

It is the current message in kill_me_maybe(), not added by me.

> 
> Ask yourself: what can you do when you see a message like that?
> 
> Exactly *nothing* because there's not nearly enough information to recover
> from it or log it or whatever. That error message is *totally useless* and
> you're upsetting your users unnecessarily and even if they report it to you,
> you can't help them.
> 

I believe we are approaching this issue from different perspectives.
As a cloud service provider, I need to address the following points:

1. I must be able to explain to end users why the MCE has occurred.
2. It is important to determine whether there are any kernel bugs that could
    compromise the overall stability of the cloud platform.
3. We need to identify and implement potential improvements.

"mce: Uncorrected hardware memory error in user-access at 3b116c400"

is *nothing* but

"mce: Action required: data load in error recoverable area of kernel"

helps.


Thanks for your time.
Shuai