linux-kernel - Re: Software based ECC ?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <1599.1186974562@turing-police.cc.vt.edu>
Date:	Sun, 12 Aug 2007 23:09:22 -0400
From:	Valdis.Kletnieks@...edu
To:	Folkert van Heusden <folkert@...heusden.com>
Cc:	roland <devzero@....de>, linux-kernel@...r.kernel.org
Subject: Re: Software based ECC ?

On Sun, 12 Aug 2007 18:51:31 +0200, Folkert van Heusden said:

> a question and an idea: Q: is ecc guaranteed to detect all bitflips?

It depends on the exact ECC function the hardware implements.  Usually it
provides performance such as:

"Correct all 1-bit errors. Detect all 2-bit errors, and most 3 and higher,
but not correct".

(Of course, "correct all 1 or 2 bit and detect all 3 bit" can be done, it
just takes more bits of ECC.)

> Idea: what about a multicore system (3 or more) that runs the same
> processes on 2 cores and a third core verifying that they both do the
> same? As I think it is not only ram that can become faulty.

This is actually done for high-reliability systems (Google for "tell me twice"
and "tell me three times").  The problem is that it takes a lot of extra
hardware.  The G5 and later IBM Z-series mainframe chipsets (not to be confused with
the PowerPC G5) implemented dual computation units and a comparator that
signals a 'Machine Check' condition if the two CPUs don't end up in the
same exact state (as an added bonus, at the end of each instruction that
both *do* compare good, it latches the *entire* state of the CPU out,
and then does the following:

1) Retry the instruction on the same CPU - if it compares correctly, keep
going and flag a "soft" error.

2) If it still fails, read out the last "known good" status latch, and load
it into a spare CPU, and fire it up, and flag the failing one as bad.

http://www.research.ibm.com/journal/rd/435/spainhower.pdf
http://www.research.ibm.com/journal/rd/435/mueller.pdf

These guys have forgotten more about designing highly reliable systems than
most of us will ever know. ;)

Needless to say, not everybody is willing to pay the costs of the hardware
overhead of this approach.  

Content of type "application/pgp-signature" skipped