lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110426223257.GB27953@sgi.com>
Date:	Tue, 26 Apr 2011 17:32:57 -0500
From:	Russ Anderson <rja@....com>
To:	"Eric W. Biederman" <ebiederm@...ssion.com>
Cc:	Borislav Petkov <bp@...64.org>, Ingo Molnar <mingo@...e.hu>,
	"H. Peter Anvin" <hpa@...or.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Tony Luck <tony.luck@...el.com>,
	EDAC devel <linux-edac@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Prarit Bhargava <prarit@...hat.com>,
	Nagananda Chumbalkar <Nagananda.Chumbalkar@...com>,
	rja@...ricas.sgi.com
Subject: Re: [PATCH -v2 2/2] x86, MCE: Drop the default decoding notifier

On Tue, Apr 26, 2011 at 02:06:39PM -0700, Eric W. Biederman wrote:
> Borislav Petkov <bp@...64.org> writes:
> > On Mon, Apr 25, 2011 at 03:40:11PM -0400, Eric W. Biederman wrote:
> >> > From: Borislav Petkov <borislav.petkov@....com>
> >> > Date: Wed, 13 Apr 2011 14:32:06 +0200
> >> > Subject: [PATCH -v2.1 2/2] x86, MCE: Drop the default decoding notifier
> >> >
> >> > The default notifier doesn't make a lot of sense to call in the
> >> > correctable errors case. Drop it and emit the mcelog decoding hint only
> >> > in the uncorrectable errors case and when no notifier is registered.
> >> > Also, limit issuing the "mcelog --ascii" message in the rare case when
> >> > we dump unreported CEs before panicking.
> >> >
> >> > While at it, remove unused old x86_mce_decode_callback from the
> >> > header.
> >> 
> >> Can we please print something if we please log something in the
> >> case of a correctable error, when we only report it via mcelog?
> >> 
> >> I have a stupid recent intel cpu here that hits that case and without
> >> the default x86_mce_decode_callback I wouldn't have even known that I am
> >> getting something like 50 correctable errors an hour on one of my
> >> machines.  In particular I am it hits so often I am seeing:
> >> "mce_notify_irq: 2 callbacks suppressed".  I need to get those dimms
> >> replaced soon because in a new product I simply can't imagine that many
> >> correctable errors.
> >
> > Isn't there a mcelog daemon or something that polls /dev/mcelog and
> > tells you about those DRAM ECCs in some log file where you're supposed
> > to look? :)
> 
> On fedora 14 there is a cron job that writes to /var/log/mcelog, and
> does not go through syslog.

Interesting.  I'm running fedora 14 and don't have a /var/log/mcelog
file or see an mcelog package (not that I'd looked until just now).

>                              But you have to be proactive and look
> there.  If the people who work on this code can't even remember
> where to look I can't imagine how anyone else can remember.
> Which is why I object to the removal of the one printk that told
> me something was broken on my machine.

Historically hardware error reporting has been very platform
dependent.  Those differences made it difficult to come up with
agreement on standard ways to report errors.  You raise a good
point that it needs to work better.

> So far from what I have seen /dev/mcelog and the userspace mcelog is
> over complicated and near useless.

/dev/mcelog is extremely useful to SGI.  As you said, "you have to 
be proactive and look there" which we are and do.  :-)

>                                     It seems to focused around the
> notion that "This is not a software problem, please do not bug
> Andi Kleen about it"
> 
> Well it is a hardware problem so I do need to RMA that hardware.
> Sigh.

You raise a good issue that users do need to know when their 
hardware is having issues.

> Eric

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@....com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ