linux-kernel - Re: [RFC 1/6] x86, NMI, Add symbol definition for NMI magic constants

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <AANLkTi=7kG+PGrQFkWETBSu37SPYxXeXNihB2eBFc6sG@mail.gmail.com>
Date:	Fri, 24 Sep 2010 19:50:16 +0800
From:	huang ying <huang.ying.caritas@...il.com>
To:	Don Zickus <dzickus@...hat.com>
Cc:	Andi Kleen <andi@...stfloor.org>,
	Huang Ying <ying.huang@...el.com>, Ingo Molnar <mingo@...e.hu>,
	"H. Peter Anvin" <hpa@...or.com>, linux-kernel@...r.kernel.org
Subject: Re: [RFC 1/6] x86, NMI, Add symbol definition for NMI magic constants

On Thu, Sep 23, 2010 at 10:16 PM, Don Zickus <dzickus@...hat.com> wrote:
> On Thu, Sep 23, 2010 at 05:29:57PM +0800, huang ying wrote:
>> Hi, Don,
>>
>> On Thu, Sep 23, 2010 at 12:07 AM, Don Zickus <dzickus@...hat.com> wrote:
>> > On Wed, Sep 22, 2010 at 12:19:16AM +0200, Andi Kleen wrote:
>> >>
>> >> >
>> >> > I guess adding either another knob to override the hardware error option
>> >> > or tying it in with the panic_on_unknown_error option might make me more
>> >> > comfortable.  That way enterprise customers can always just enable it by
>> >> > default and desktop users (for now) could have it off.
>> >>
>> >> Anything that needs explicit enabling is a bad idea, that
>> >> would lead to a lot of users running in "corrupt my data" mode.
>> >
>> > I know.  But as I said earlier in my emails, I am trying to figure out how
>> > to deal with the fallout from unknown nmis turning into panics.  Today
>> > people see unknown nmis.  They may or may not be corrupting data.  They
>> > usually file a bug.  Currently it is hard for me to diagnosis the problem.
>> > Usually the old 'upgrade your bios/firmware' does the trick.  Sometimes it
>> > doesn't and people feel like their machines still run fine.  So they
>> > ignore it (for good or for bad).
>> >
>> > Turning unknown nmis into panics would break their current setup without
>> > much gain.  So I was trying to propose something temporarily until we
>> > could get a better infrastructure to isolate the source and provide better
>> > info on what to do.
>> >
>> > I agree with you that long term unknown nmis should be panics.  I just get
>> > nervous about doing that now from a support perspective.
>>
>> In fact, we use white list policy here. Only systems with HEST or
>> identified by chipset host bridge PCI ID will panic for unknown NMI.
>> So I think systems you worried about will not have this enabled.
>>
>> >> The code currently uses the presence of a HEST error table
>> >> to detect a server. HEST should be only available on servers.
>> >>
>> >> On servers at least panic should be default.
>> >
>> > Ok.  That's fine. But then what.  What does a developer do with that
>> > panic?  There's no useful info.  That is sorta my problem.  Then again I
>> > do not know much about HEST.
>>
>> On some system, there is some hardware error log in BMC/BIOS. The
>> hardware error log can be gotten via IPMI or BIOS menu. Otherwise, can
>> we get some useful info for unknown NMI? If we can, can we collect the
>> info, then print it on console and save it into flash via ERST (part
>> of APEI too) before panic?
>
> Ok.  Does the BIOS/BMC automatically do this?  Can we just print a message
> on panic saying checking your BIOS/BMC logs for more info?

Yes. BIOS/BMC automatically do that. And I will add it to panic message.

> I would love to add code to gather more useful info for unknown NMIs, but
> is it expected that HEST does some of this?  I guess what I am trying to
> figure out, if we are going to put intelligence to detect a HEST enabled
> machine and panic when unknown NMI comes along (presumably from HEST??),
> then can we leverage HEST at all to understand why the NMI happened or
> point the user to the BIOS/BMC to get more info.  In other words, what
> value do we get HEST other than we detect its there, lets panic.

Yes. HEST can be used to report some hardware error information. I am
working on that now.

>> HEST is defined in ACPI spec 4.0 and later version in section named
>> APEI (ACPI Platform Error Interface). It is used to describe the error
>> sources of system. It should be available only on server platform.
>
> Ok.  Does the kernel have intelligence to use it or the BIOS yet?

HEST works in kernel BIOS cooperative way. I am working on a HEST
driver which will get notified for NMI and collect the error
information reported by BIOS. But It is possible that some systems
have only BMC/BIOS log and do not report that to OS except unknown
NMI. The unknown NMI panic logic is for these systems.

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/