linux-kernel - Re: Hardware Error Kernel Mini-Summit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20100519090323.GA18073@basil.fritz.box>
Date:	Wed, 19 May 2010 11:03:24 +0200
From:	Andi Kleen <andi@...stfloor.org>
To:	"Eric W. Biederman" <ebiederm@...ssion.com>
Cc:	Andi Kleen <andi@...stfloor.org>, Borislav Petkov <bp@...64.org>,
	"Luck, Tony" <tony.luck@...el.com>,
	Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>,
	Mauro Carvalho Chehab <mchehab@...hat.com>,
	"Young, Brent" <brent.young@...el.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Matt Domsch <Matt_Domsch@...l.com>,
	Doug Thompson <dougthompson@...ssion.com>,
	Joe Perches <joe@...ches.com>, Ingo Molnar <mingo@...e.hu>,
	"bluesmoke-devel@...ts.sourceforge.net" 
	<bluesmoke-devel@...ts.sourceforge.net>,
	Linux Edac Mailing List <linux-edac@...r.kernel.org>
Subject: Re: Hardware Error Kernel Mini-Summit

Hi Eric,

> I'm not ready to believe the average person that is running linux
> is too stupid to understand the difference between a hardware
> error and a software error.

Experience disagrees with you (that is not sure about average,
but at least there's a significant portion) 

Also again today there are other reasons for it.

> 
> > But there's more to it now:
> >
> >> If your system isn't broken correctable errors are rare.  People look
> >
> > Actually the more memory you have the more common they are.
> > And the trend is to more and more memory.
> 
> The error rate should not be fixed per bit but should be roughly fixed
> per DIMM.  If the error rate over time is fixed per bit we are in deep
> trouble.

Error rates of good DIMMs scale roughly with the number of transistors.
It's not the only influence though, but a major one.

> > Really to do anything useful with them you need trends
> > and automatic actions (like predictive page offlining)
> 
> Not at all, and I don't have a clue where you start thinking
> predictive page offlining makes the least bit of sense.  Broken
> or even weak bits are rarely the common reason for ECC errors.

There are various studies that disagree with you on that.

> 
> > A log isn't really a good format for that
> 
> A log is a fine format for realizing you have a problem.  A

A low steady rate of corrected errors on a large system
is expected.  In fact if you look at the memory error log.
of a large system (towards TBs) it nearly always has some 
memory related events.

In this case a log is not really useful. What you need
is useful thresholds and a good summary.

> - Errors that occur frequently. That is broken hardware of one time or
>   another.  I want to know about that so I can schedule down time to replace
>   my memory before I get an uncorrected ECC error.  Errors of this kind
>   are likely happening frequently enough as to impact performance.

Same issue here: if something is truly broken it floods
you with errors.

First this costs a lot of time to process and it does not 
actually tell you anything useful because most errors in a flood
are similar.

Basically you don't care if you have 100 or 1000 errors, 
and you definitely don't want all the of the errors filling up
your disk and using up your CPU.

Again a threshold with an action is much more useful here.

-Andi
-- 
ak@...ux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/