[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1599.1186974562@turing-police.cc.vt.edu>
Date: Sun, 12 Aug 2007 23:09:22 -0400
From: Valdis.Kletnieks@...edu
To: Folkert van Heusden <folkert@...heusden.com>
Cc: roland <devzero@....de>, linux-kernel@...r.kernel.org
Subject: Re: Software based ECC ?
On Sun, 12 Aug 2007 18:51:31 +0200, Folkert van Heusden said:
> a question and an idea: Q: is ecc guaranteed to detect all bitflips?
It depends on the exact ECC function the hardware implements. Usually it
provides performance such as:
"Correct all 1-bit errors. Detect all 2-bit errors, and most 3 and higher,
but not correct".
(Of course, "correct all 1 or 2 bit and detect all 3 bit" can be done, it
just takes more bits of ECC.)
> Idea: what about a multicore system (3 or more) that runs the same
> processes on 2 cores and a third core verifying that they both do the
> same? As I think it is not only ram that can become faulty.
This is actually done for high-reliability systems (Google for "tell me twice"
and "tell me three times"). The problem is that it takes a lot of extra
hardware. The G5 and later IBM Z-series mainframe chipsets (not to be confused with
the PowerPC G5) implemented dual computation units and a comparator that
signals a 'Machine Check' condition if the two CPUs don't end up in the
same exact state (as an added bonus, at the end of each instruction that
both *do* compare good, it latches the *entire* state of the CPU out,
and then does the following:
1) Retry the instruction on the same CPU - if it compares correctly, keep
going and flag a "soft" error.
2) If it still fails, read out the last "known good" status latch, and load
it into a spare CPU, and fire it up, and flag the failing one as bad.
http://www.research.ibm.com/journal/rd/435/spainhower.pdf
http://www.research.ibm.com/journal/rd/435/mueller.pdf
These guys have forgotten more about designing highly reliable systems than
most of us will ever know. ;)
Needless to say, not everybody is willing to pay the costs of the hardware
overhead of this approach.
Content of type "application/pgp-signature" skipped
Powered by blists - more mailing lists