lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 5 Aug 2008 12:02:18 -0500
From:	"Linas Vepstas" <linasvepstas@...il.com>
To:	"Alan Cox" <alan@...rguk.ukuu.org.uk>
Cc:	"John Stoffel" <john@...ffel.org>,
	"Alistair John Strachan" <alistair@...zero.co.uk>,
	linux-kernel@...r.kernel.org
Subject: Re: amd64 sata_nv (massive) memory corruption

2008/8/3 Alan Cox <alan@...rguk.ukuu.org.uk>:

>> -- The bad ram passes memtest86+
>
> You are assuming bad RAM then not bad bus loadings, corrosion on the
> pins.. ?

Yes, probably bad timing due to bus loading or bad impedance
due to bad connector, or whatever.

> If you have a good enough pile of hardware and the right monitoring stuff
> loaded then you should get EDAC event logs

I've got the AMD 570 chipset, which is older than the
amd76x that edac supports.  The latest MB's seem to have
the AMD 790 chipset, which is also not currently supported.

Can anyone get me the portion of the AMD 570 (nVidia
nForce 570) chip specs that describe the RAM ECC
error event counters? (I assume that this chip has some
sort of error reporting or counting registers) I can sign
NDA if needed.

> The more interesting approaches I think
> are the fs level ones where you accept the fact that hardware sucks and
> do end to end checksumming from the fs or even the app in some
> situations. We don't yet have that functionality mainstream although it
> might make an interesting device mapper module ...

I'm game. Care to guide me through?  So: on every write, this
new device mapper module computes a checksum and stores
it somewhere. On every read, it computes a checksum and
compares to the stored value. Easy enough I guess.

Several hard parts:
-- where to store the checksums?
-- what to do (besides print to dmesg) if there's a mismatch?
-- on an md raid-1, if there's a checksum error on one of the
   disks, then one could check the other disk to see if its good.
   This suggests a new API:

   ++ "is this block device an md device?"
   ++ "if yes to above, then give me alternate block"
   ++ "invalidate copy n of block x"
         (this last, because presumably one wants to tell md that
         one of its copies is bad.)

  (Actually, above API would be interesting for fsck too ..
   if fsck is failing with one copy from a raid set, it would
   be interesting to see if an alternate copy passes fsck.)

-- but perhaps the storage containing the checksums themselves
    was corrupted. Not sure what to do then. If the checksums
    are corrupted, I don't want to accidentally flag large portions
    of a block device being bad, when its actually good.

An alternative would be file-level checksums built into the
file system. I'm not thrilled by this, because it fails to focus
on errors caused by bad hardware. Its also too close to
trip-wire like function, and I don't want to get into conversations
about security & etc.

I'm paranoid enough to be willing to implement something like
this .. is the above design on the right track?

--linas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ