[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20101011212006.GB23882@redhat.com>
Date: Mon, 11 Oct 2010 17:20:06 -0400
From: Don Zickus <dzickus@...hat.com>
To: Huang Ying <ying.huang@...el.com>
Cc: Ingo Molnar <mingo@...e.hu>, "H. Peter Anvin" <hpa@...or.com>,
linux-kernel@...r.kernel.org, Andi Kleen <andi@...stfloor.org>,
Robert Richter <robert.richter@....com>
Subject: Re: [PATCH -v3 5/6] x86, NMI, treat unknown NMI as hardware error
On Sat, Oct 09, 2010 at 02:49:46PM +0800, Huang Ying wrote:
> In general, unknown NMI is used by hardware and firmware to notify
> fatal hardware errors to OS. So the Linux should treat unknown NMI as
> hardware error and go panic upon unknown NMI for better error
> containment.
>
> But there are some broken hardware, which will generate unknown NMI
> not for hardware error. To support these machines, a white list
> mechanism is provided to treat unknown NMI as hardware error only on
> some known working system.
>
> These systems are identified via the presentation of APEI HEST or
> some PCI ID of the host bridge. The PCI ID of host bridge instead of
> DMI ID is used, so that the checking can be done based on the platform
> type instead of motherboard. This should be simpler and sufficient.
>
> The method to identify the platforms is designed by Andi Kleen.
I don't have any major problems with the other patches in the patch
series. In fact I would like to get them committed somewhere, so we can
continue building on them.
> @@ -366,6 +368,15 @@ unknown_nmi_error(unsigned char reason,
> if (notify_die(DIE_NMIUNKNOWN, "nmi", regs, reason, 2, SIGINT) ==
> NOTIFY_STOP)
> return;
> + /*
> + * On some platforms, hardware errors may be notified via
> + * unknown NMI
> + */
> + if (unknown_nmi_as_hwerr)
> + panic(
> + "NMI for hardware error without error record: Not continuing\n"
> + "Please check BIOS/BMC log for further information.");
> +
> #ifdef CONFIG_MCA
> /*
> * Might actually be able to figure out what the guilty party
The only quirk I have left is the above piece, which is basically a
philosophy difference with Robert and myself. Where we believe it should
be on the die_chain and Andi and yourself would like to see it explicitly
called out.
If we move to a new notifier chain, like we discussed in another thread,
would you guys be willing to move this into that new notifier chain or is
your argument still going to stand?
Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists