linux-kernel - Re: [PATCH 2/2] x86/mce: Dump the stack for recoverable machine checks in kernel context

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Y2AVmOdEtTl5e68l@zn.tnic>
Date:   Mon, 31 Oct 2022 19:36:08 +0100
From:   Borislav Petkov <bp@...en8.de>
To:     "Luck, Tony" <tony.luck@...el.com>
Cc:     Yazen Ghannam <yazen.ghannam@....com>,
        Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>,
        Carlos Bilbao <carlos.bilbao@....com>,
        "x86@...nel.org" <x86@...nel.org>,
        "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] x86/mce: Dump the stack for recoverable machine
 checks in kernel context

On Mon, Oct 31, 2022 at 05:13:10PM +0000, Luck, Tony wrote:
> > 1. If the error has raised a MCE, then we will dump stack anyway.
> 
> I don't see stack dumps for machine check panics. I don't have any non-standard
> settings (I think). Nor do I see them in the panic messages that other folks send
> to me.
> 
> Are you settting some CONFIG or command line option to get a stack dump?

Well, if one were sane, one would assume that one would expect to see a
stack dump when the machine panics, right? I mean, it is only fair...

And there's an attempt:

#ifdef CONFIG_DEBUG_BUGVERBOSE 
        /*
         * Avoid nested stack-dumping if a panic occurs during oops processing
         */
        if (!test_taint(TAINT_DIE) && oops_in_progress <= 1)
                dump_stack();
#endif

but that oops_in_progress thing is stopping us:

[   13.706764] mce: [Hardware Error]: CPU 2: Machine Check Exception: 6 Bank 4: fe000010000b0c0f
[   13.706781] mce: [Hardware Error]: RIP 10:<ffffffff8103bbcb> {trigger_mce+0xb/0x10}
[   13.706791] mce: [Hardware Error]: TSC c83826d14 ADDR e1101add1e550012 MISC cafebeef 
[   13.706795] mce: [Hardware Error]: PROCESSOR 2:a00f11 TIME 1667244167 SOCKET 0 APIC 2 microcode 1000065
[   13.706809] mce: [Hardware Error]: Machine check: Processor Context Corrupt
[   13.706810] panic: on entry: oops_in_progress: 1
[   13.706812] panic: before bust_spinlocks oops_in_progress: 1
[   13.706813] Kernel panic - not syncing: Fatal local machine check
[   13.706814] panic: taint: 0, oops_in_progress: 2
[   13.707133] Kernel Offset: disabled

as panic() is being entered with oops_in_progress already set to 1. That
oops_in_progress thing looks like is being used for console unblanking.

Looking at

  026ee1f66aaa ("panic: fix stack dump print on direct call to panic()")

it hints that panic() might've been called twice for oops_in_progress to
be already 1 on entry.

I guess we need to figure out why that is...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette