[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BANLkTimoqYZMs15H8wRqEEe=UyBJtAwipw@mail.gmail.com>
Date: Fri, 13 May 2011 21:17:13 +0800
From: huang ying <huang.ying.caritas@...il.com>
To: Don Zickus <dzickus@...hat.com>
Cc: Huang Ying <ying.huang@...el.com>, Ingo Molnar <mingo@...e.hu>,
linux-kernel@...r.kernel.org, Andi Kleen <andi@...stfloor.org>,
Robert Richter <robert.richter@....com>,
Andi Kleen <ak@...ux.intel.com>
Subject: Re: [RFC] x86, NMI, Treat unknown NMI as hardware error
Hi, Don,
On Fri, May 13, 2011 at 8:45 PM, Don Zickus <dzickus@...hat.com> wrote:
> On Fri, May 13, 2011 at 04:23:38PM +0800, Huang Ying wrote:
>> In general, unknown NMI is used by hardware and firmware to notify
>> fatal hardware errors to OS. So the Linux should treat unknown NMI as
>> hardware error and go panic upon unknown NMI for better error
>> containment.
>
> I have a couple of concerns about this patch. One I don't think BIOSes
> are ready for this. I have Intel Westmere boxes that say they have a
> valid HEST, GHES, and EINJ table, but when I inject an error there is no
> GHES record. This leaves me with an unknown NMI and panic. Yeah, it is a
> BIOS bug I guess, but I think vendors are going to be slow fixing all this
> stuff (my Nehalem box is in even worse shape with this stuff).
Although there is no GHES record, I think the Westmere box behavior is
acceptable, an unknown NMI is used by BIOS to notify hardware error,
this is what we want to deal with in this patch.
> Also, is there any known issues with x86_64 platforms with bad NMIs? RHEL
> has had unknown NMI's panic on x86_64 since x86_64 first came out, I don't
> recall any exceptions we had to add to handle 'quirky' hardware.
>
> Then for the i686 case, because the 'quirky' hardware is so old, can't we
> just leave it a kernel config option to switch between using a 'printk'
> vs. a 'panic'? Or even a kernel command line option.
>
> I figure these 'quirky' hardware machines are more the exception nowdays,
> do we really need to add code to whitelist machines?
>
> Granted I am not familiar enough with the quirky hardware (in fact I don't
> think I have seen any mainly because I haven't been around long enough).
> Most cases I see when trolling through the fedora bugzilla list for
> unknown NMIs, is just bad firmware or acpi power configurations.
>
> Just wondering if we could just simplify the patch somehow with better
> assumptions.
So there is still unknown NMIs on real hardware now. I am afraid turn
on panic on unknown NMI by default may be not acceptable for someone.
Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists