lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 19 Nov 2014 11:29:54 +0100
From:	Borislav Petkov <bp@...en8.de>
To:	ruiv.wang@...il.com
Cc:	linux-kernel@...r.kernel.org, tony.luck@...el.com,
	gong.chen@...ux.intel.com, rui.y.wang@...el.com
Subject: Re: [PATCH v3] x86/mce: Try printing all machine check banks known
 before panic

On Wed, Nov 19, 2014 at 05:22:41PM +0800, ruiv.wang@...il.com wrote:
> From: Rui Wang <rui.y.wang@...el.com>
> 
> There are cases when an machine check panics without giving any information
> about the error:
> 
> [  177.806166] Kernel panic - not syncing: Machine check from unknown source
> 
> No information besides that it is a machine check. This happens in two cases:
> 1) The CPU logs the error with the MCi_STATUS.EN bit set to zero, and Linux
>    ignores EN=0 entries (as it should).

Well, I guess we shouldn't anymore. Apparently hw forgets to set the
bit when raising an MCE so then we should ignore it too in mce-severity
and delete that piece or grade it as higher severity based on, I dunno,
b0rked hardware family/model/stepping or whatever bit we set...

        MCESEV(
                NO, "Not enabled",
                BITCLR(MCI_STATUS_EN)
                ),

> 2) In normal processing the MCE handler ignores banks that do not contain fatal
>    or unrecoverable errors (these would later be found and logged by the CMCI
>    handler). If we panic, these will never be logged, but could be important
>    to diagnose the problem.

Well, we do this:

                /*
                 * Non uncorrected or non signaled errors are handled by
                 * machine_check_poll. Leave them alone, unless this panics.
                 */
                if (!(m.status & (cfg->ser ? MCI_STATUS_S : MCI_STATUS_UC)) &&
                        !no_way_out)
                        continue;

so no_way_out gets indirectly controlled by mce-severity too. So I guess
mce-severity would need adjusting instead of adding more stuff to the #MC
handler.

Btw, the panic message comes from

        /*
         * No machine check event found. Must be some external
         * source or one CPU is hung. Panic.
         */
        if (global_worst <= MCE_KEEP_SEVERITY && mca_cfg.tolerant < 3)
                mce_panic("Machine check from unknown source", NULL, NULL);

so fixing mce_severity is what should happen here instead, IMO.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ