lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3ae3aa420808042229l675ffd79p42a5691532b7ac3b@mail.gmail.com>
Date:	Tue, 5 Aug 2008 00:29:30 -0500
From:	"Linas Vepstas" <linasvepstas@...il.com>
To:	"Robert Hancock" <hancockr@...w.ca>
Cc:	"John Stoffel" <john@...ffel.org>,
	"Alistair John Strachan" <alistair@...zero.co.uk>,
	linux-kernel@...r.kernel.org
Subject: Re: amd64 sata_nv (massive) memory corruption

2008/8/3 Robert Hancock <hancockr@...w.ca>:
> Linas Vepstas wrote:
>>
>> What I don't like is that the corruption was utterly silent --
[...]
>> So the question is: is there some sort of sata (or pci) "loopback mode",
>> where we could pump data through all of the busses and controllers, up
>> near to the point where it would normally go out to the serdes to the
>> disk,
>> but instead have it loop back, so that we could test the buses between
>> endpoints?
>
> I don't imagine that would be very useful in this case. The SATA link, PCI
> Express bus, HyperTransport bus all have parity or CRC error checking, so
> presumably they couldn't be likely to cause undetected errors. The
> transitions between them could cause problems,

Well, but I suffered badly from an undetected error, in the sense
that the operating system had no knowledge of it, and it corrupted
data on disk as a result. As Alan Cox suggests, perhaps I didn't
have EDAC turned on, or something ... I'm investigating now.
But this is moot -- if there is software that already exists that
could have reported the error to the kernel, then this software
should have been installed/enabled/operating by default.

> and most desktop machines
> don't have ECC memory which could catch memory timing problems or bad RAM

I'm unclear on ECC memory: if a motherboard "supports ECC",
does it mean it actually uses ECC bits in the bus between the
memory controller and the RAM?  Or does it simply mean that
it won't hang if I plug in ECC RAM (but otherwise ignore the bits)?

Personally I'm ready to pop $$$ for ECC it if will actually do
something for me, this has been painful.

--linas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ