linux-kernel - Re: [RFC] x86, NMI, Treat unknown NMI as hardware error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <4DF56934.9000705@intel.com>
Date:	Mon, 13 Jun 2011 09:34:44 +0800
From:	Huang Ying <ying.huang@...el.com>
To:	Don Zickus <dzickus@...hat.com>
CC:	Andi Kleen <ak@...ux.intel.com>,
	Cyrill Gorcunov <gorcunov@...il.com>,
	huang ying <huang.ying.caritas@...il.com>,
	Ingo Molnar <mingo@...e.hu>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Andi Kleen <andi@...stfloor.org>,
	Robert Richter <robert.richter@....com>
Subject: Re: [RFC] x86, NMI, Treat unknown NMI as hardware error

On 06/09/2011 08:09 PM, Don Zickus wrote:
> On Fri, May 20, 2011 at 04:13:25PM +0800, Huang Ying wrote:
>> Hi, Don,
>>
>> On 05/18/2011 03:07 AM, Don Zickus wrote:
>>> On Tue, May 17, 2011 at 11:18:59AM -0700, Andi Kleen wrote:
>>>>> Random thought, in the Firmware first mode of HEST (which is the only way
>>>>> GHES records get produced??), does an SCI happen first to jump into the
>>>>> firmware for processing, then an NMI?
>>>>
>>>> Either that or there is a separate service processor which handles it.
>>>> Presumably it depends a lot on the particular system.
>>>
>>> Ah interesting.  I was going to suggest somehow setting a bit when an SCI
>>> comes in and check that bit in the unknown NMI path as a possible hint
>>> that the NMI might be related to HEST (sorta how we flag unknown NMIs in
>>> the perf code).
>>>
>>> It was just an idea.  Obviously a service processor will make that more
>>> difficult. :-)
>>
>> Hmm, what's the conclusion?  Do you think unknown NMI should be seen as
>> hardware error?  At least on some white listed machines?
> 
> I still sorta have the opinion that a hardware error should be able be
> recognizable either through a GHES record or a bit in the southbridge.
> Whereas an unknown NMI is something lost and has no owner as the result of
> either a buggy NMI handler or an unimplemented NMI handler.
> 
> Yeah, I can see hardware errors coming in through an unknown NMI but to me
> (from what I am reading about with APEI/GHES) is those should be trapped
> by the firmware and if they aren't then the firmware is broken.  In those
> cases it should be up to the OEM to provide proper firmware (even certify
> them) to allow the proper experience, which includes being properly
> trapped by an NMI handler.
> 
> Perhaps I am a bit naive in my belief but I am a little nervous panicing
> all the time on unknown NMIs when we are still chasing missed perf NMIs on
> a loaded box.

I think things SHOULD go this way too.  This just is not the reality.

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/