linux-kernel - Re: [RFC 5/6] x86, NMI, Add support to notify hardware error with unknown NMI

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20100913172438.37443bf7@basil.nowhere.org>
Date:	Mon, 13 Sep 2010 17:24:38 +0200
From:	Andi Kleen <andi@...stfloor.org>
To:	Don Zickus <dzickus@...hat.com>
Cc:	Huang Ying <ying.huang@...el.com>, Ingo Molnar <mingo@...e.hu>,
	"H. Peter Anvin" <hpa@...or.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [RFC 5/6] x86, NMI, Add support to notify hardware error with
 unknown NMI

Don,

> Unfortunately, most of the bugzillas I deal with, unkown NMIs are the
> result of SERRs.  While you can consider that hardware error
> reporting, the easiest way for me to debug those problems currently
> is to have reporters run 'lspci -vvv' after the NMI is displayed to
> figure out who caused the NMI.
> 
> My fear is that panic'ing the box on unknown NMIs on those platforms
> will hinder my ability to easily debug those NMIs.

The reason the NMI is sent is that there is a "lost" 
data corruption somewhere in the system and if you don't 
stop it the system the corruption may make it to disk,
become permanent etc. The hardware was designed
under the assumption that  the system is stopped by software
when this happens (same reason as why many machine
checks cause panics)

Then there isn't necessarily something to "debug": data corruption
can happen without any bugs being around (and in fact
that's the common case, assuming production systems)

So I'm not sure what you're debugging here. Are you being the support
technician for the system through bugzilla? That sounds
inefficient.

Anyways for hardware support we could probably dump some
more information at panic or better through error
serialization, but the word "debug" is IMHO an very wrong
way to think about that.

If this is about driver debugging it's entirely reasonable
to have a special setting (e.g. disable the panic), 
but the defaults should be set in a way to avoid
spreading data corruption,.

-Andi

-- 
ak@...ux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/