linux-kernel - Re: amd64 sata_nv (massive) memory corruption

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Tue, 5 Aug 2008 18:21:19 +0100
From:	Alan Cox <alan@...rguk.ukuu.org.uk>
To:	linasvepstas@...il.com
Cc:	"John Stoffel" <john@...ffel.org>,
	"Alistair John Strachan" <alistair@...zero.co.uk>,
	linux-kernel@...r.kernel.org
Subject: Re: amd64 sata_nv (massive) memory corruption

> I've got the AMD 570 chipset, which is older than the
> amd76x that edac supports.  The latest MB's seem to have
> the AMD 790 chipset, which is also not currently supported.

AMD76x is very early 32bit so probably not..

The later AMD don't appear in the chipset specific code as the
hypedtransport era processors have on processor memory controllers and
use MCE reporting for that providing you have suitable memory etc.
Instead mcelog will decode them for you. The generic edac support for PCI
error scanning still applies.

> Can anyone get me the portion of the AMD 570 (nVidia
> nForce 570) chip specs that describe the RAM ECC
> error event counters? (I assume that this chip has some
> sort of error reporting or counting registers) I can sign
> NDA if needed.

C|N>K I've never even been able to extract IDE controller docs from
nVidia..

> I'm game. Care to guide me through?  So: on every write, this
> new device mapper module computes a checksum and stores
> it somewhere. On every read, it computes a checksum and
> compares to the stored value. Easy enough I guess.
> 
> Several hard parts:
> -- where to store the checksums?

That is the million dollar question - plus you can argue it is the fs
that should do it. There is stuff crawling through the standards world to
provide a small per block additional info area on disk sectors.

> -- what to do (besides print to dmesg) if there's a mismatch?

Configurable - panic/offline/warn ?

>    This suggests a new API:
> 
>    ++ "is this block device an md device?"
>    ++ "if yes to above, then give me alternate block"
>    ++ "invalidate copy n of block x"
>          (this last, because presumably one wants to tell md that
>          one of its copies is bad.)

It's the same as dm RAID hitting a physical read error. In the former
case you got the data back but it is wrong (so useless) in the latter you
got nothing back.

> I'm paranoid enough to be willing to implement something like
> this .. is the above design on the right track?

Yes. If you can figure out where to keep the checksums without ruining
performance (and of course if there isn't one lurking in device mapper
world not yet submitted).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/