[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4DD1809D.8090403@gmail.com>
Date: Mon, 16 May 2011 23:53:01 +0400
From: Cyrill Gorcunov <gorcunov@...il.com>
To: Don Zickus <dzickus@...hat.com>
CC: Huang Ying <ying.huang@...el.com>,
huang ying <huang.ying.caritas@...il.com>,
Ingo Molnar <mingo@...e.hu>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Andi Kleen <andi@...stfloor.org>,
Robert Richter <robert.richter@....com>,
Andi Kleen <ak@...ux.intel.com>
Subject: Re: [RFC] x86, NMI, Treat unknown NMI as hardware error
On 05/16/2011 11:03 PM, Don Zickus wrote:
> On Mon, May 16, 2011 at 09:09:45AM +0800, Huang Ying wrote:
>>> Ying, the concern is rather related to the code scheme in general. Since
>>> we have notifiers I think the better way to be consistent here and use
>>> hwerr notifier too. But it's IMHO ;)
>>
>> As for go notifiers or not. IMHO, a rule can be:
>>
>> - If it is something like a driver, than it should go notifier
>> - If it is architectural/PC defacto standard, it can sit outside of
>> notifier.
>
> Hmm, then what do you do about perf? That is architectural and a defacto
> standard, but I am not sure hardcoding that would be appropriate.
Good point!
>
>>
>> I think that seeing unknown NMI as hardware error should be part of PC
>> defacto standard. Do you think so?
>
> Well after thinking about it, I would say no. And my reason is, if
> vendors are really serious about using NMIs as an indicator for hardware
> errors, shouldn't they be setting a bit in the memory controller/north
> bridge or south bridge/IOHC for an NMI handler to read? I mean hardware
UV platform has such bit iirc :)
> devices don't just get wired directly to the NMI pin on the cpu, right?
> They generally have to go through some hub that acts as a multiplexer.
>
> In those cases, why can't those hubs set a bit saying it detected an error
> (don't PCIe bridges already do that?) and let the NMI handler read it to
> confirm. This way we can leave 'unknown NMIs' as a way to say an
> unclaimed NMI entered the system and we can have users set policy about
> what to do, panic, printk, whatever.
>
> But for the HEST stuff, it should be smart enough by now to trap any
> hardware error, no? How does a machine that supports HEST let a hardware
> error get through without detecting it? Isn't that the point? Detect a
> hardware error, grab as much info about it as possible, save the error
> record and then panic?
>
> Otherwise if you just panic, then you have no idea why the machine errored
> in the first place. It might be the safe thing to do in some
> circumstances, but then you have to wonder why the fancy HEST enabled
> server didn't catch it. Isn't that what people are spending extra money
> for those Intel servers with RAS features?
>
> Cheers,
> Don
--
Cyrill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists