linux-kernel - Re: [PATCH -v2 6/7] x86, NMI, Add support to notify hardware error with unknown NMI

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1285636761.20791.133.camel@yhuang-dev>
Date:	Tue, 28 Sep 2010 09:19:21 +0800
From:	Huang Ying <ying.huang@...el.com>
To:	Robert Richter <robert.richter@....com>
Cc:	huang ying <huang.ying.caritas@...il.com>,
	Don Zickus <dzickus@...hat.com>, Ingo Molnar <mingo@...e.hu>,
	"H. Peter Anvin" <hpa@...or.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Andi Kleen <andi@...stfloor.org>
Subject: Re: [PATCH -v2 6/7] x86, NMI, Add support to notify hardware error
 with unknown NMI

On Mon, 2010-09-27 at 21:38 +0800, Robert Richter wrote:
> On 27.09.10 08:47:53, huang ying wrote:
> 
> > >>  arch/x86/kernel/hwerr.c    |   55 +++++++++++++++++++++++++++++++++++++++++++++
> > >
> > > Instead of creating this file the code should be implemented in
> > >
> > >  arch/x86/kernel/cpu/intel.c
> > >
> > > Similar AMD NB code is implemented in amd.c and k8.c.
> > 
> > Why? This file is not vendor specific.
> 
> No, it only implements an Intel specific PCI device, nothing else.

You can add AMD specific PCI device here too. We will add more device ID
in the future.

> > >> +late_initcall(check_unknown_nmi_for_hwerr);
> > >
> > > Maybe you can use early pci functions like read_pci_config() to avoid
> > > late init.
> > 
> > I don't think late init is a big issue. Hardware error is rare after all.
> 
> Just want to let you know this as an option.
> 
> > >> --- a/arch/x86/kernel/traps.c
> > >> +++ b/arch/x86/kernel/traps.c
> > >> @@ -83,6 +83,8 @@ EXPORT_SYMBOL_GPL(used_vectors);
> > >>
> > >>  static int ignore_nmis;
> > >>
> > >> +int unknown_nmi_for_hwerr;
> > >
> > > If it is an nmi for hwerr, it is no longer an unknown nmi. So we
> > > should drop 'unknow' in the naming.
> > 
> > I think unkown NMI is the one we can not identify the source.
> > Something like anonymous.
> > 
> > >> +
> > >>  /*
> > >>   * Prevent NMI reason port (0x61) being accessed simultaneously, can
> > >>   * only be used in NMI handler.
> > >> @@ -360,6 +362,14 @@ io_check_error(unsigned char reason, str
> > >>  static notrace __kprobes void
> > >>  unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
> > >>  {
> > >> +     /*
> > >> +      * On some platforms, hardware errors may be notified via
> > >> +      * unknown NMI
> > >> +      */
> > >> +     if (unknown_nmi_for_hwerr)
> > >> +             panic("NMI for hardware error without error record: "
> > >> +                   "Not continuing");
> > >> +
> > >
> > > Instead of checking this flag you should implement and register an nmi
> > > handler for this case.
> > 
> > I think explicit function calls have better readability than notifier chains.
> 
> What is different to unknown_nmi() then?
> 
> So no, in your case you want to catch unknown nmis for a certain
> hardware and then throw a panic.

No. We do NOT catch unknown NMIs for a certain hardware here. We put the
code here because we think it is general instead of hardware specific.

It should be a general rule to treat unknown NMI as hardware error. But
to avoid to confuse some users have broken hardware (which will generate
unknown NMI not for hardware error), we use a white list (machines with
HEST or workable chipset via PCI ID).

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/