linux-kernel - Re: [RFC] x86, NMI, Treat unknown NMI as hardware error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20110520115802.GI14745@elte.hu>
Date:	Fri, 20 May 2011 13:58:02 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	Huang Ying <ying.huang@...el.com>
Cc:	Don Zickus <dzickus@...hat.com>,
	huang ying <huang.ying.caritas@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Andi Kleen <andi@...stfloor.org>,
	Robert Richter <robert.richter@....com>,
	Andi Kleen <ak@...ux.intel.com>, Borislav Petkov <bp@...en8.de>
Subject: Re: [RFC] x86, NMI, Treat unknown NMI as hardware error


* Huang Ying <ying.huang@...el.com> wrote:

> On 05/17/2011 04:53 PM, Ingo Molnar wrote:
> > 
> > * Huang Ying <ying.huang@...el.com> wrote:
> > 
> >> On 05/16/2011 07:29 PM, Ingo Molnar wrote:
> >>>
> >>> * Don Zickus <dzickus@...hat.com> wrote:
> >>>
> >>>> On Fri, May 13, 2011 at 05:20:33PM +0200, Ingo Molnar wrote:
> >>>>>
> >>>>> * huang ying <huang.ying.caritas@...il.com> wrote:
> >>>>>
> >>>>>>> What should be done instead is to add an event for unknown NMIs, which can 
> >>>>>>> then be processed by the RAS daemon to implement policy.
> >>>>>>>
> >>>>>>> By using 'active' event filters it could even be set on a system to panic 
> >>>>>>> the box by default.
> >>>>>>
> >>>>>> If there is real fatal hardware error, maybe we have no luxury to go from NMI 
> >>>>>> handler to user space RAS daemon to determine what to do. System may explode, 
> >>>>>> bad data may go to disk before that.
> >>>>>
> >>>>> That is why i suggested:
> >>>>>
> >>>>>   > > By using 'active' event filters it could even be set on a system to panic 
> >>>>>   > > the box by default.
> >>>>>
> >>>>> event filters are evaluated in the kernel, so the panic could be instantaneous, 
> >>>>> without the event having to reach user-space.
> >>>>
> >>>> Interesting.  Question though, what do you mean by 'event filtering'.  Is 
> >>>> that different then setting 'unknown_nmi_panic' panic on the commandline or 
> >>>> procfs?
> >>>>
> >>>> Or are you suggesting something like registering another callback on the 
> >>>> die_chain that looks for DIE_NMIUNKNOWN as the event, swallows them and 
> >>>> implements the policy?  That way only on HEST related platforms would 
> >>>> register them while others would keep the default of 'Dazed and confused' 
> >>>> messages?
> >>>
> >>> The idea is that "event filters", which are an existing upstream feature and 
> >>> which can be used in rather flexible ways:
> >>>
> >>>   http://lkml.org/lkml/2011/4/27/660
> >>>
> >>> Could be used to trigger non-standard policy action as well - such as to panic 
> >>> the box.
> >>>
> >>> This would replace various very limited /debugfs and /sys event filtering hacks 
> >>> (and hardcoded policies) such as arch/x86/kernel/cpu/mcheck/mce-severity.c, and 
> >>> it would allow nonstandard behavior like 'panic the box on unknown NMIs' as 
> >>> well.
> >>>
> >>> This could be set by the RAS daemon, and it could be propagated to the kernel 
> >>> boot line as well, where event filter syntax would look like this:
> >>>
> >>>   events=nmi::unknown"if (reason == 0) panic();"
> >>>
> >>> (Where the 'reason' field of the NMI event is the current legacy 'reason' value 
> >>> there.)
> >>>
> >>> The filter code would have to be modified to be able to recognize the panic() 
> >>> bit, but that's desirable anyway and it is a one-time effort.
> >>>
> >>> This:
> >>>
> >>>   events=nmi::unknown:"if (reason == 0) ignore();"
> >>>
> >>> would be a possible outcome as well, on certain boxes - to skip certain events.
> >>
> >> We can determine whether NMI is unknown in kernel now.  If you want to push 
> >> all unknown NMI logic into user space (although I don't think that is the 
> >> best solution), is it not sufficient that just check system in user space 
> >> (via PCI ID or DMI ID, etc) and set existing "unknown_nmi_panic" accordingly?
> > 
> > yeah - no need to push the 'reason' if it's not needed.
> > 
> > We want the kernel defaults to be sane - i.e. this is not to 'push' anything to 
> > user-space in a forced way, this is to make *optional*, different policy action 
> > possible to configure.
> 
> OK.  Then, what is the proper default behavior?  We think Linux kernel
> should treat unknown NMI as hardware error reporting, at least on some
> modern machines (via a white list).  Do you agree?

No, i do not agree *at all*.

We are seeing cases of spurious NMIs again and again. Crashing boxes should be 
a niche thing, something you can configure if you want to but the kernel should 
not default it until NMI demultiplexing becomes more robust - and i doubt it 
ever will.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/