linux-kernel - RE: [PATCH 2/2] x86/mce: Dump the stack for recoverable machine checks in kernel context

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <SJ1PR11MB6083564E9626FFB4681CA3B7FC379@SJ1PR11MB6083.namprd11.prod.outlook.com>
Date:   Mon, 31 Oct 2022 19:20:38 +0000
From:   "Luck, Tony" <tony.luck@...el.com>
To:     Borislav Petkov <bp@...en8.de>
CC:     Yazen Ghannam <yazen.ghannam@....com>,
        Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>,
        Carlos Bilbao <carlos.bilbao@....com>,
        "x86@...nel.org" <x86@...nel.org>,
        "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH 2/2] x86/mce: Dump the stack for recoverable machine
 checks in kernel context

> Well, if one were sane, one would assume that one would expect to see a
> stack dump when the machine panics, right? I mean, it is only fair...

Stack dump from a machine check wasn't at all useful until h/w and Linux started
supporting recoverable machine checks. The stack dump is there to help diagnose
and fix s/w problems. But a machine check isn't a software problem.

So I was pretty happy with the status quo of not getting a stack dump from
a machine check panic.

With recoverable machine checks there are some cases where there might
be an opportunity to change the kernel to avoid a crash. See my patches that
akpm just took into the "mm" tree to recover when the kernel hits poison during
a copy-on-write:

https://lore.kernel.org/all/20221021200120.175753-1-tony.luck@intel.com/

or the patches from Google to recover when khugepaged hits poison:

https://lore.kernel.org/linux-mm/20221010160142.1087120-1-jiaqiyan@google.com/

To identify additional opportunities to make the kernel more resilient, it would be useful
to get a kernel stack trace in the specific case of a recoverable data consumption
machine check while executing in the kernel.

> And there's an attempt:
>
> #ifdef CONFIG_DEBUG_BUGVERBOSE
>         /*
>          * Avoid nested stack-dumping if a panic occurs during oops processing
>          */
>         if (!test_taint(TAINT_DIE) && oops_in_progress <= 1)
>                 dump_stack();
> #endif
>
> but that oops_in_progress thing is stopping us: 

...

> it hints that panic() might've been called twice for oops_in_progress to
> be already 1 on entry.
>
> I guess we need to figure out why that is...

It might be interesting, but a distraction from the goal of my patch to only
dump the stack for recoverable machine checks in kernel code.

-Tony