[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <89027155-8ca3-46a5-8c3a-e24b903cb3eb@linux.alibaba.com>
Date: Wed, 5 Mar 2025 09:50:13 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: "Luck, Tony" <tony.luck@...el.com>, Borislav Petkov <bp@...en8.de>,
"Yazen.Ghannam@....com" <yazen.ghannam@....com>
Cc: "nao.horiguchi@...il.com" <nao.horiguchi@...il.com>,
"tglx@...utronix.de" <tglx@...utronix.de>,
"mingo@...hat.com" <mingo@...hat.com>,
"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
"x86@...nel.org" <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
"linmiaohe@...wei.com" <linmiaohe@...wei.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"peterz@...radead.org" <peterz@...radead.org>,
"jpoimboe@...nel.org" <jpoimboe@...nel.org>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"baolin.wang@...ux.alibaba.com" <baolin.wang@...ux.alibaba.com>,
"tianruidong@...ux.alibaba.com" <tianruidong@...ux.alibaba.com>
Subject: Re: [PATCH v2 2/5] x86/mce: dump error msg from severities
在 2025/3/4 00:49, Luck, Tony 写道:
>> The error context is in the behavior of the hw. If the error is fatal, you
>> won't see it - the machine will panic or do something else to prevent error
>> propagation. It definitely won't run any software anymore.
>>
>> If you see the error getting logged, it means it is not fatal enough to kill
>> the machine.
>
> One place in the fatal case where I would like to see more information is the
>
> "Action required: data load in error *UN*recoverable area of kernel"
>
> [emphasis on the "UN" added].
Do you mean this one?
MCESEV(
PANIC, "Data load in unrecoverable area of kernel",
SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA),
KERNEL
),
>
> case. We have a few places where the kernel does recover. And most places
> we crash. Our code for the recoverable cases is fragile.Most of this series is
> about repairing regressions where we used to recover from places where kernel
> is doing get_user() or copy_from_user() which can be recovered if those places
> get an error return and the kernel kills the process instead of crashing.
I can’t agree with you more.
> A long time ago I posted some patches to include a stack trace for this type
> of crash. It didn't make it into the kernel, and I got distracted by other things.
>
> If we had that, it would have been easier to diagnose this regression (Shaui
> Xie would have seen crashes with a stack trace pointing to code that used
> to recover in older kernels). Folks with big clusters would also be able to
> point out other places where the kernel crashes often enough that additional
> EXTABLE recovery paths would be worth investigating.
Agreed, a stack trace will be helpful for debug unrecoverable cases.
The current panic message is bellow:
[ 1879.726794] mce: [Hardware Error]: CPU 178: Machine Check Exception: f Bank 1: bd80000000100134
[ 1879.726798] mce: [Hardware Error]: RIP 10:<ffffffff981d7af3> {futex_wait_setup+0x83/0xf0}
[ 1879.726807] mce: [Hardware Error]: TSC 49a1e6001c1 ADDR 80f7ada400 MISC 86 PPIN fc6b80e0ba9d616
[ 1879.726809] mce: [Hardware Error]: PROCESSOR 0:806f4 TIME 1741091252 SOCKET 1 APIC c5 microcode 2b000571
[ 1879.726811] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 1879.726813] mce: [Hardware Error]: Machine check events logged
[ 1879.727166] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
[ 1879.727168] Kernel panic - not syncing: Fatal local machine check
It only provides a RIP and I spent a lot time to figure out the root cause about
why get_user() and copy_from_user() fail in upstream kernel.
>
> So:
>
> 1) We need to fix the regressions. That just needs new commit messages
> for these patches that explain the issue better.
I will polish commit message.
>
> 2) I'd like to see a patch for a stack trace for the unrecoverable case.
Could you provide any reference link to your previous patch?
>
> 3) I don't see much value in a message that reports the recoverable case.
>
Got it.
Thanks
Shuai
Powered by blists - more mailing lists