lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 12 Apr 2011 21:24:03 -0500
From:	Russ Anderson <rja@....com>
To:	"Luck, Tony" <tony.luck@...el.com>
Cc:	Borislav Petkov <bp@...en8.de>,
	Prarit Bhargava <prarit@...hat.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"dzickus@...hat.com" <dzickus@...hat.com>,
	"mstowe@...hat.com" <mstowe@...hat.com>,
	"dnelson@...hat.com" <dnelson@...hat.com>, rja@...ricas.sgi.com
Subject: Re: [PATCH]: mce: don't print "human readable" message for corrected errors

On Tue, Apr 12, 2011 at 01:02:21PM -0700, Luck, Tony wrote:
> > Why not? This way you turn reporting of _ALL_ correctable MCEs
> > completely off and some users would actually like to run them through
> > mcelog on Intel.
> 
> pr_emerg() is rather overkill for a corrected error - on large systems
> corrected errors are going to be a routine occurrence (my personal estimation
> is "one soft error per gigabyte per month" ... which is pretty much the
> same as "one per terabyte per hour" for the people with the really cool
> toys.

Good point.

> We are also setting TAINT_MACHINE_CHECK for corrected errors - perhaps
> this made sense when systems were small and machine checks were rare and
> scary.  But I think we need to start working with the reality that
> corrected errors are normal events.

I agree.  Corrected errors - by definition - have hardware corrected data.
There is no corruption so there is no reason for kernel taint.  It would
be like setting taint when one hard drive of a RAID file system goes bad.

It's worth noting that linux does not set taint when it recovers from 
_uncorrected_ memory errors on IA64 (by killing the application
that consumed the bad data and discarding the bad page).  Modern hardware
has enough error detection/correction code to avoid undetected data 
corruption from memory errors.


-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@....com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux - Powered by OpenVZ