[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110513135154.GB31888@redhat.com>
Date: Fri, 13 May 2011 09:51:54 -0400
From: Don Zickus <dzickus@...hat.com>
To: huang ying <huang.ying.caritas@...il.com>
Cc: Huang Ying <ying.huang@...el.com>, Ingo Molnar <mingo@...e.hu>,
linux-kernel@...r.kernel.org, Andi Kleen <andi@...stfloor.org>,
Robert Richter <robert.richter@....com>,
Andi Kleen <ak@...ux.intel.com>
Subject: Re: [RFC] x86, NMI, Treat unknown NMI as hardware error
On Fri, May 13, 2011 at 09:17:13PM +0800, huang ying wrote:
> Hi, Don,
>
> On Fri, May 13, 2011 at 8:45 PM, Don Zickus <dzickus@...hat.com> wrote:
> > On Fri, May 13, 2011 at 04:23:38PM +0800, Huang Ying wrote:
> >> In general, unknown NMI is used by hardware and firmware to notify
> >> fatal hardware errors to OS. So the Linux should treat unknown NMI as
> >> hardware error and go panic upon unknown NMI for better error
> >> containment.
> >
> > I have a couple of concerns about this patch. One I don't think BIOSes
> > are ready for this. I have Intel Westmere boxes that say they have a
> > valid HEST, GHES, and EINJ table, but when I inject an error there is no
> > GHES record. This leaves me with an unknown NMI and panic. Yeah, it is a
> > BIOS bug I guess, but I think vendors are going to be slow fixing all this
> > stuff (my Nehalem box is in even worse shape with this stuff).
>
> Although there is no GHES record, I think the Westmere box behavior is
> acceptable, an unknown NMI is used by BIOS to notify hardware error,
> this is what we want to deal with in this patch.
I don't think having HEST changes the situation. I agree with your
statement above, but I can also generate unknown NMIs from stressing perf.
Broken hardware usually generated NMIs, sometimes they propogated to the
cpu, other times, the were swallowed by the chipset. Which means having
HEST or not having HEST doesn't improve anything nor make it any worse.
IOW I don't think we gain anything with this patch.
>
> > Also, is there any known issues with x86_64 platforms with bad NMIs? RHEL
> > has had unknown NMI's panic on x86_64 since x86_64 first came out, I don't
> > recall any exceptions we had to add to handle 'quirky' hardware.
> >
> > Then for the i686 case, because the 'quirky' hardware is so old, can't we
> > just leave it a kernel config option to switch between using a 'printk'
> > vs. a 'panic'? Or even a kernel command line option.
> >
> > I figure these 'quirky' hardware machines are more the exception nowdays,
> > do we really need to add code to whitelist machines?
> >
> > Granted I am not familiar enough with the quirky hardware (in fact I don't
> > think I have seen any mainly because I haven't been around long enough).
> > Most cases I see when trolling through the fedora bugzilla list for
> > unknown NMIs, is just bad firmware or acpi power configurations.
> >
> > Just wondering if we could just simplify the patch somehow with better
> > assumptions.
>
> So there is still unknown NMIs on real hardware now. I am afraid turn
> on panic on unknown NMI by default may be not acceptable for someone.
The opposite could be said too. I think that was Ingo's point. The
policy should be left in the hands of the user or distro because there is
no right answer.
Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists