lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100913154750.GA26290@redhat.com>
Date:	Mon, 13 Sep 2010 11:47:50 -0400
From:	Don Zickus <dzickus@...hat.com>
To:	Andi Kleen <andi@...stfloor.org>
Cc:	Huang Ying <ying.huang@...el.com>, Ingo Molnar <mingo@...e.hu>,
	"H. Peter Anvin" <hpa@...or.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [RFC 5/6] x86, NMI, Add support to notify hardware error with
 unknown NMI

On Mon, Sep 13, 2010 at 05:24:38PM +0200, Andi Kleen wrote:
> 
> Don,
> 
> > Unfortunately, most of the bugzillas I deal with, unkown NMIs are the
> > result of SERRs.  While you can consider that hardware error
> > reporting, the easiest way for me to debug those problems currently
> > is to have reporters run 'lspci -vvv' after the NMI is displayed to
> > figure out who caused the NMI.
> > 
> > My fear is that panic'ing the box on unknown NMIs on those platforms
> > will hinder my ability to easily debug those NMIs.
> 
> 
> The reason the NMI is sent is that there is a "lost" 
> data corruption somewhere in the system and if you don't 
> stop it the system the corruption may make it to disk,
> become permanent etc. The hardware was designed
> under the assumption that  the system is stopped by software
> when this happens (same reason as why many machine
> checks cause panics)

Yeah, I know. I was being too ignorant perhaps.

> 
> Then there isn't necessarily something to "debug": data corruption
> can happen without any bugs being around (and in fact
> that's the common case, assuming production systems)
> 
> So I'm not sure what you're debugging here. Are you being the support
> technician for the system through bugzilla? That sounds
> inefficient.

The problem I repeatedly deal with for RHEL systems is a customer sees an
unknown NMI printed on their screen and sometimes the machine falls apart
shortly after, sometimes it doesn't.  Obviously they are going to file a
bug asking why.  A chunk of the problems are bad hardware/firmware.  But
the problem is which one.

Replacing a slot card is easy, replacing a motherboard is not.  So I
usually try to determine which device is failing by walking the pci bus
and looking for the serr bits or some of the pci-e status bits.

It is inefficient, but I haven't had time to figure out a way to clean it
up.  And just for the record, I usually see an unknown NMI report every
other week.

> 
> Anyways for hardware support we could probably dump some
> more information at panic or better through error
> serialization, but the word "debug" is IMHO an very wrong
> way to think about that.

Well, I can use 'diagnos' or 'determine' if you want.  But at the end of
the day, we have customers that see scary software messages and expect us
to give them reasonable direction to fix their problems.

> 
> If this is about driver debugging it's entirely reasonable
> to have a special setting (e.g. disable the panic), 
> but the defaults should be set in a way to avoid
> spreading data corruption,.

Ok.  I can accept that.

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ