linux-kernel - Re: [PATCH v2 2/5] x86/mce: dump error msg from severities

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <89027155-8ca3-46a5-8c3a-e24b903cb3eb@linux.alibaba.com>
Date: Wed, 5 Mar 2025 09:50:13 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: "Luck, Tony" <tony.luck@...el.com>, Borislav Petkov <bp@...en8.de>,
 "Yazen.Ghannam@....com" <yazen.ghannam@....com>
Cc: "nao.horiguchi@...il.com" <nao.horiguchi@...il.com>,
 "tglx@...utronix.de" <tglx@...utronix.de>,
 "mingo@...hat.com" <mingo@...hat.com>,
 "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
 "x86@...nel.org" <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
 "linmiaohe@...wei.com" <linmiaohe@...wei.com>,
 "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
 "peterz@...radead.org" <peterz@...radead.org>,
 "jpoimboe@...nel.org" <jpoimboe@...nel.org>,
 "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "linux-mm@...ck.org" <linux-mm@...ck.org>,
 "baolin.wang@...ux.alibaba.com" <baolin.wang@...ux.alibaba.com>,
 "tianruidong@...ux.alibaba.com" <tianruidong@...ux.alibaba.com>
Subject: Re: [PATCH v2 2/5] x86/mce: dump error msg from severities



在 2025/3/4 00:49, Luck, Tony 写道:
>> The error context is in the behavior of the hw. If the error is fatal, you
>> won't see it - the machine will panic or do something else to prevent error
>> propagation. It definitely won't run any software anymore.
>>
>> If you see the error getting logged, it means it is not fatal enough to kill
>> the machine.
> 
> One place in the fatal case where I would like to see more information is the
> 
>    "Action required: data load in error *UN*recoverable area of kernel"
> 
> [emphasis on the "UN" added].

Do you mean this one?

     MCESEV(
         PANIC, "Data load in unrecoverable area of kernel",
         SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA),
         KERNEL
        ),


> 
> case.  We have a few places where the kernel does recover. And most places
> we crash. Our code for the recoverable cases is fragile.Most of this series is
> about repairing regressions where we used to recover from places where kernel
> is doing get_user() or copy_from_user() which can be recovered if those places
> get an error return and the kernel kills the process instead of crashing.

I can’t agree with you more.


> A long time ago I posted some patches to include a stack trace for this type
> of crash. It didn't make it into the kernel, and I got distracted by other things.
> 
> If we had that, it would have been easier to diagnose this regression (Shaui
> Xie would have seen crashes with a stack trace pointing to code that used
> to recover in older kernels). Folks with big clusters would also be able to
> point out other places where the kernel crashes often enough that additional
> EXTABLE recovery paths would be worth investigating.

Agreed, a stack trace will be helpful for debug unrecoverable cases.
The current panic message is bellow:

[ 1879.726794] mce: [Hardware Error]: CPU 178: Machine Check Exception: f Bank 1: bd80000000100134
[ 1879.726798] mce: [Hardware Error]: RIP 10:<ffffffff981d7af3> {futex_wait_setup+0x83/0xf0}
[ 1879.726807] mce: [Hardware Error]: TSC 49a1e6001c1 ADDR 80f7ada400 MISC 86 PPIN fc6b80e0ba9d616
[ 1879.726809] mce: [Hardware Error]: PROCESSOR 0:806f4 TIME 1741091252 SOCKET 1 APIC c5 microcode 2b000571
[ 1879.726811] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 1879.726813] mce: [Hardware Error]: Machine check events logged
[ 1879.727166] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
[ 1879.727168] Kernel panic - not syncing: Fatal local machine check


It only provides a RIP and I spent a lot time to figure out the root cause about
why get_user() and copy_from_user() fail in upstream kernel.

> 
> So:
> 
> 1) We need to fix the regressions. That just needs new commit messages
> for these patches that explain the issue better.

I will polish commit message.

> 
> 2) I'd like to see a patch for a stack trace for the unrecoverable case.

Could you provide any reference link to your previous patch?

> 
> 3) I don't see much value in a message that reports the recoverable case.
> 

Got it.

Thanks
Shuai