[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ1PR11MB6083564E9626FFB4681CA3B7FC379@SJ1PR11MB6083.namprd11.prod.outlook.com>
Date: Mon, 31 Oct 2022 19:20:38 +0000
From: "Luck, Tony" <tony.luck@...el.com>
To: Borislav Petkov <bp@...en8.de>
CC: Yazen Ghannam <yazen.ghannam@....com>,
Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>,
Carlos Bilbao <carlos.bilbao@....com>,
"x86@...nel.org" <x86@...nel.org>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH 2/2] x86/mce: Dump the stack for recoverable machine
checks in kernel context
> Well, if one were sane, one would assume that one would expect to see a
> stack dump when the machine panics, right? I mean, it is only fair...
Stack dump from a machine check wasn't at all useful until h/w and Linux started
supporting recoverable machine checks. The stack dump is there to help diagnose
and fix s/w problems. But a machine check isn't a software problem.
So I was pretty happy with the status quo of not getting a stack dump from
a machine check panic.
With recoverable machine checks there are some cases where there might
be an opportunity to change the kernel to avoid a crash. See my patches that
akpm just took into the "mm" tree to recover when the kernel hits poison during
a copy-on-write:
https://lore.kernel.org/all/20221021200120.175753-1-tony.luck@intel.com/
or the patches from Google to recover when khugepaged hits poison:
https://lore.kernel.org/linux-mm/20221010160142.1087120-1-jiaqiyan@google.com/
To identify additional opportunities to make the kernel more resilient, it would be useful
to get a kernel stack trace in the specific case of a recoverable data consumption
machine check while executing in the kernel.
> And there's an attempt:
>
> #ifdef CONFIG_DEBUG_BUGVERBOSE
> /*
> * Avoid nested stack-dumping if a panic occurs during oops processing
> */
> if (!test_taint(TAINT_DIE) && oops_in_progress <= 1)
> dump_stack();
> #endif
>
> but that oops_in_progress thing is stopping us:
...
> it hints that panic() might've been called twice for oops_in_progress to
> be already 1 on entry.
>
> I guess we need to figure out why that is...
It might be interesting, but a distraction from the goal of my patch to only
dump the stack for recoverable machine checks in kernel code.
-Tony
Powered by blists - more mailing lists