linux-kernel - Re: [RFC] x86, NMI, Treat unknown NMI as hardware error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4DD1809D.8090403@gmail.com>
Date:	Mon, 16 May 2011 23:53:01 +0400
From:	Cyrill Gorcunov <gorcunov@...il.com>
To:	Don Zickus <dzickus@...hat.com>
CC:	Huang Ying <ying.huang@...el.com>,
	huang ying <huang.ying.caritas@...il.com>,
	Ingo Molnar <mingo@...e.hu>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Andi Kleen <andi@...stfloor.org>,
	Robert Richter <robert.richter@....com>,
	Andi Kleen <ak@...ux.intel.com>
Subject: Re: [RFC] x86, NMI, Treat unknown NMI as hardware error

On 05/16/2011 11:03 PM, Don Zickus wrote:
> On Mon, May 16, 2011 at 09:09:45AM +0800, Huang Ying wrote:
>>>   Ying, the concern is rather related to the code scheme in general. Since
>>> we have notifiers I think the better way to be consistent here and use
>>> hwerr notifier too. But it's IMHO ;)
>>
>> As for go notifiers or not.  IMHO, a rule can be:
>>
>> - If it is something like a driver, than it should go notifier
>> - If it is architectural/PC defacto standard, it can sit outside of
>> notifier.
> 
> Hmm, then what do you do about perf?  That is architectural and a defacto
> standard, but I am not sure hardcoding that would be appropriate.

  Good point!

> 
>>
>> I think that seeing unknown NMI as hardware error should be part of PC
>> defacto standard.  Do you think so?
> 
> Well after thinking about it, I would say no.  And my reason is, if
> vendors are really serious about using NMIs as an indicator for hardware
> errors, shouldn't they be setting a bit in the memory controller/north
> bridge or south bridge/IOHC for an NMI handler to read?  I mean hardware

  UV platform has such bit iirc :)

> devices don't just get wired directly to the NMI pin on the cpu, right?
> They generally have to go through some hub that acts as a multiplexer.
> 
> In those cases, why can't those hubs set a bit saying it detected an error
> (don't PCIe bridges already do that?) and let the NMI handler read it to
> confirm.  This way we can leave 'unknown NMIs' as a way to say an
> unclaimed NMI entered the system and we can have users set policy about
> what to do, panic, printk, whatever.
> 
> But for the HEST stuff, it should be smart enough by now to trap any
> hardware error, no?  How does a machine that supports HEST let a hardware
> error get through without detecting it?  Isn't that the point?  Detect a
> hardware error, grab as much info about it as possible, save the error
> record and then panic?
> 
> Otherwise if you just panic, then you have no idea why the machine errored
> in the first place.  It might be the safe thing to do in some
> circumstances, but then you have to wonder why the fancy HEST enabled
> server didn't catch it.  Isn't that what people are spending extra money
> for those Intel servers with RAS features?
> 
> Cheers,
> Don

-- 
            Cyrill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/