linux-kernel - Re: [RFC 1/6] x86, NMI, Add symbol definition for NMI magic constants

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Thu, 23 Sep 2010 17:29:57 +0800
From:	huang ying <huang.ying.caritas@...il.com>
To:	Don Zickus <dzickus@...hat.com>
Cc:	Andi Kleen <andi@...stfloor.org>,
	Huang Ying <ying.huang@...el.com>, Ingo Molnar <mingo@...e.hu>,
	"H. Peter Anvin" <hpa@...or.com>, linux-kernel@...r.kernel.org
Subject: Re: [RFC 1/6] x86, NMI, Add symbol definition for NMI magic constants

Hi, Don,

On Thu, Sep 23, 2010 at 12:07 AM, Don Zickus <dzickus@...hat.com> wrote:
> On Wed, Sep 22, 2010 at 12:19:16AM +0200, Andi Kleen wrote:
>>
>> >
>> > I guess adding either another knob to override the hardware error option
>> > or tying it in with the panic_on_unknown_error option might make me more
>> > comfortable.  That way enterprise customers can always just enable it by
>> > default and desktop users (for now) could have it off.
>>
>> Anything that needs explicit enabling is a bad idea, that
>> would lead to a lot of users running in "corrupt my data" mode.
>
> I know.  But as I said earlier in my emails, I am trying to figure out how
> to deal with the fallout from unknown nmis turning into panics.  Today
> people see unknown nmis.  They may or may not be corrupting data.  They
> usually file a bug.  Currently it is hard for me to diagnosis the problem.
> Usually the old 'upgrade your bios/firmware' does the trick.  Sometimes it
> doesn't and people feel like their machines still run fine.  So they
> ignore it (for good or for bad).
>
> Turning unknown nmis into panics would break their current setup without
> much gain.  So I was trying to propose something temporarily until we
> could get a better infrastructure to isolate the source and provide better
> info on what to do.
>
> I agree with you that long term unknown nmis should be panics.  I just get
> nervous about doing that now from a support perspective.

In fact, we use white list policy here. Only systems with HEST or
identified by chipset host bridge PCI ID will panic for unknown NMI.
So I think systems you worried about will not have this enabled.

>> The code currently uses the presence of a HEST error table
>> to detect a server. HEST should be only available on servers.
>>
>> On servers at least panic should be default.
>
> Ok.  That's fine. But then what.  What does a developer do with that
> panic?  There's no useful info.  That is sorta my problem.  Then again I
> do not know much about HEST.

On some system, there is some hardware error log in BMC/BIOS. The
hardware error log can be gotten via IPMI or BIOS menu. Otherwise, can
we get some useful info for unknown NMI? If we can, can we collect the
info, then print it on console and save it into flash via ERST (part
of APEI too) before panic?

HEST is defined in ACPI spec 4.0 and later version in section named
APEI (ACPI Platform Error Interface). It is used to describe the error
sources of system. It should be available only on server platform.

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/