[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y2AVmOdEtTl5e68l@zn.tnic>
Date: Mon, 31 Oct 2022 19:36:08 +0100
From: Borislav Petkov <bp@...en8.de>
To: "Luck, Tony" <tony.luck@...el.com>
Cc: Yazen Ghannam <yazen.ghannam@....com>,
Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>,
Carlos Bilbao <carlos.bilbao@....com>,
"x86@...nel.org" <x86@...nel.org>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] x86/mce: Dump the stack for recoverable machine
checks in kernel context
On Mon, Oct 31, 2022 at 05:13:10PM +0000, Luck, Tony wrote:
> > 1. If the error has raised a MCE, then we will dump stack anyway.
>
> I don't see stack dumps for machine check panics. I don't have any non-standard
> settings (I think). Nor do I see them in the panic messages that other folks send
> to me.
>
> Are you settting some CONFIG or command line option to get a stack dump?
Well, if one were sane, one would assume that one would expect to see a
stack dump when the machine panics, right? I mean, it is only fair...
And there's an attempt:
#ifdef CONFIG_DEBUG_BUGVERBOSE
/*
* Avoid nested stack-dumping if a panic occurs during oops processing
*/
if (!test_taint(TAINT_DIE) && oops_in_progress <= 1)
dump_stack();
#endif
but that oops_in_progress thing is stopping us:
[ 13.706764] mce: [Hardware Error]: CPU 2: Machine Check Exception: 6 Bank 4: fe000010000b0c0f
[ 13.706781] mce: [Hardware Error]: RIP 10:<ffffffff8103bbcb> {trigger_mce+0xb/0x10}
[ 13.706791] mce: [Hardware Error]: TSC c83826d14 ADDR e1101add1e550012 MISC cafebeef
[ 13.706795] mce: [Hardware Error]: PROCESSOR 2:a00f11 TIME 1667244167 SOCKET 0 APIC 2 microcode 1000065
[ 13.706809] mce: [Hardware Error]: Machine check: Processor Context Corrupt
[ 13.706810] panic: on entry: oops_in_progress: 1
[ 13.706812] panic: before bust_spinlocks oops_in_progress: 1
[ 13.706813] Kernel panic - not syncing: Fatal local machine check
[ 13.706814] panic: taint: 0, oops_in_progress: 2
[ 13.707133] Kernel Offset: disabled
as panic() is being entered with oops_in_progress already set to 1. That
oops_in_progress thing looks like is being used for console unblanking.
Looking at
026ee1f66aaa ("panic: fix stack dump print on direct call to panic()")
it hints that panic() might've been called twice for oops_in_progress to
be already 1 on entry.
I guess we need to figure out why that is...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Powered by blists - more mailing lists