lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1287725037.2862.84.camel@yhuang-dev>
Date:	Fri, 22 Oct 2010 13:23:57 +0800
From:	Huang Ying <ying.huang@...el.com>
To:	Don Zickus <dzickus@...hat.com>
Cc:	Andi Kleen <andi@...stfloor.org>, Ingo Molnar <mingo@...e.hu>,
	"H. Peter Anvin" <hpa@...or.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Robert Richter <robert.richter@....com>,
	"peterz@...radead.org" <peterz@...radead.org>
Subject: Re: [PATCH -v3 5/6] x86, NMI, treat unknown NMI as hardware error

On Fri, 2010-10-22 at 10:56 +0800, Don Zickus wrote:
> On Fri, Oct 22, 2010 at 10:05:10AM +0800, Huang Ying wrote:
> > > > > Well, do you have an alternative way to handle broken hardware?  Broken
> > > > > hardware has generated NMIs, sometimes if I am lucky SERRs.  The ones that
> > > > > generate SERRs can be filtered through a different path, but what about
> > > > > the ones that don't?
> > > > > 
> > > > 
> > > > Don, AFAIK you're saying the same thing as Ying: an unknown NMI is 
> > > > a hardware error.
> > > > 
> > > > The reason the hardware does that is that it wants to tell us:
> > > > 
> > > > "I lost track of an error. There is corrupted data somewhere in the system.
> > > > Please stop, don't do anything that could consume that data. S.O.S."
> > > > 
> > > > The correct answer for that is panic.
> > > 
> > > After re-reading Huang's patch, I am starting to understand what you mean
> > > by broken hardware.  Basically you are trying to distinguish between
> > > legacy systems that were 'broken' in the sense they would randomly send
> > > uknown NMIs for no good reason, hence the 'Dazed and confused' messages
> > > and hardware errors on more modern systems that say, 'Hardware error,
> > > panicing check your BIOS for more info' (or whatever).
> > 
> > Yes.
> > 
> > > So Huang's patch was sort of acting like a switch.  On legacy systems use
> > > 'Dazed and confused' for unknown NMIs.  Whereas on whitelisted modern
> > > systems use a more relavant 'Check BIOS for error' message.  Is that
> > > right?
> > 
> > In fact we want to go panic and 'check BIOS for error, contact your
> > hardware vendor' for all systems. But as you said, there are some
> > 'broken hardware' randomly send unknown NMIs for no good reason. So a
> > white list is used for them. And not all pre-Nehalem machines are
> > 'broken' in fact.
> 
> Ok, I think I finally understand what you guys are trying to do.  I also
> can't see a problem with it.  

Thanks.

> Though I think the patch could probably use
> some clean up to make it more clear.  Off the top of my head perhaps a
> function call that sets the variable unknown_nmi_as_hwerr instead of
> setting it explicitly and maybe structuring unknown_nmi() with an if-then
> modern-message; else legacy-message; to possibly make it obvious what the
> code is trying to acheive.

OK. Will do it.

> And yeah I know not all pre-Nehalem machines are broken.  I am usually
> sarcastic when I mention that just because being at IDF last year, I got
> the impression that pre-Nehalem machines were considered the dark ages.
> :-)

Haha

> I am actually curious to know how many x86_64 machines would be considered
> broken?

Don't know either.

> > > That's why you guys are complaining that registering a die_notifier would
> > > be silly?
> > 
> > I think whether going die_notifier or unknown_nmi_error() depends on it
> > is general or specific for some hardware. Do you agree with that?
> 
> Well I am hoping the only general case would be the one you want to use
> now.  Everything else would be specific and require a die_notifier.  I
> mean how many different ways do we want to have a printk/panic in
> unknown_nmi()?

I think this one should be the only one for general unknown NMI
processing.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ