[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTi=7kG+PGrQFkWETBSu37SPYxXeXNihB2eBFc6sG@mail.gmail.com>
Date: Fri, 24 Sep 2010 19:50:16 +0800
From: huang ying <huang.ying.caritas@...il.com>
To: Don Zickus <dzickus@...hat.com>
Cc: Andi Kleen <andi@...stfloor.org>,
Huang Ying <ying.huang@...el.com>, Ingo Molnar <mingo@...e.hu>,
"H. Peter Anvin" <hpa@...or.com>, linux-kernel@...r.kernel.org
Subject: Re: [RFC 1/6] x86, NMI, Add symbol definition for NMI magic constants
On Thu, Sep 23, 2010 at 10:16 PM, Don Zickus <dzickus@...hat.com> wrote:
> On Thu, Sep 23, 2010 at 05:29:57PM +0800, huang ying wrote:
>> Hi, Don,
>>
>> On Thu, Sep 23, 2010 at 12:07 AM, Don Zickus <dzickus@...hat.com> wrote:
>> > On Wed, Sep 22, 2010 at 12:19:16AM +0200, Andi Kleen wrote:
>> >>
>> >> >
>> >> > I guess adding either another knob to override the hardware error option
>> >> > or tying it in with the panic_on_unknown_error option might make me more
>> >> > comfortable. That way enterprise customers can always just enable it by
>> >> > default and desktop users (for now) could have it off.
>> >>
>> >> Anything that needs explicit enabling is a bad idea, that
>> >> would lead to a lot of users running in "corrupt my data" mode.
>> >
>> > I know. But as I said earlier in my emails, I am trying to figure out how
>> > to deal with the fallout from unknown nmis turning into panics. Today
>> > people see unknown nmis. They may or may not be corrupting data. They
>> > usually file a bug. Currently it is hard for me to diagnosis the problem.
>> > Usually the old 'upgrade your bios/firmware' does the trick. Sometimes it
>> > doesn't and people feel like their machines still run fine. So they
>> > ignore it (for good or for bad).
>> >
>> > Turning unknown nmis into panics would break their current setup without
>> > much gain. So I was trying to propose something temporarily until we
>> > could get a better infrastructure to isolate the source and provide better
>> > info on what to do.
>> >
>> > I agree with you that long term unknown nmis should be panics. I just get
>> > nervous about doing that now from a support perspective.
>>
>> In fact, we use white list policy here. Only systems with HEST or
>> identified by chipset host bridge PCI ID will panic for unknown NMI.
>> So I think systems you worried about will not have this enabled.
>>
>> >> The code currently uses the presence of a HEST error table
>> >> to detect a server. HEST should be only available on servers.
>> >>
>> >> On servers at least panic should be default.
>> >
>> > Ok. That's fine. But then what. What does a developer do with that
>> > panic? There's no useful info. That is sorta my problem. Then again I
>> > do not know much about HEST.
>>
>> On some system, there is some hardware error log in BMC/BIOS. The
>> hardware error log can be gotten via IPMI or BIOS menu. Otherwise, can
>> we get some useful info for unknown NMI? If we can, can we collect the
>> info, then print it on console and save it into flash via ERST (part
>> of APEI too) before panic?
>
> Ok. Does the BIOS/BMC automatically do this? Can we just print a message
> on panic saying checking your BIOS/BMC logs for more info?
Yes. BIOS/BMC automatically do that. And I will add it to panic message.
> I would love to add code to gather more useful info for unknown NMIs, but
> is it expected that HEST does some of this? I guess what I am trying to
> figure out, if we are going to put intelligence to detect a HEST enabled
> machine and panic when unknown NMI comes along (presumably from HEST??),
> then can we leverage HEST at all to understand why the NMI happened or
> point the user to the BIOS/BMC to get more info. In other words, what
> value do we get HEST other than we detect its there, lets panic.
Yes. HEST can be used to report some hardware error information. I am
working on that now.
>> HEST is defined in ACPI spec 4.0 and later version in section named
>> APEI (ACPI Platform Error Interface). It is used to describe the error
>> sources of system. It should be available only on server platform.
>
> Ok. Does the kernel have intelligence to use it or the BIOS yet?
HEST works in kernel BIOS cooperative way. I am working on a HEST
driver which will get notified for NMI and collect the error
information reported by BIOS. But It is possible that some systems
have only BMC/BIOS log and do not report that to OS except unknown
NMI. The unknown NMI panic logic is for these systems.
Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists