[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3ae3aa420808062132x52860092p9dee56705ba99a3@mail.gmail.com>
Date: Wed, 6 Aug 2008 23:32:06 -0500
From: "Linas Vepstas" <linasvepstas@...il.com>
To: "Martin K. Petersen" <martin.petersen@...cle.com>
Cc: "Alan Cox" <alan@...rguk.ukuu.org.uk>,
"John Stoffel" <john@...ffel.org>,
"Alistair John Strachan" <alistair@...zero.co.uk>,
linux-kernel@...r.kernel.org
Subject: Re: amd64 sata_nv (massive) memory corruption
2008/8/6 Martin K. Petersen <martin.petersen@...cle.com>:
>>>>>> "Linas" == Linas Vepstas <linasvepstas@...il.com> writes:
>
> [I got added to the CC: late in the game so I don't have the
> background this discussion]
You haven't missed anything, other than I've had my
umpteenth instance of data corruption in some years,
and am up to my eyeballs in consumer-grade hardware
from which I would like to get enterprise-grade reliability.
Of course, being a cheapskate is what gets me into this
mess.
> ZFS and btrfs both support redundancy within the filesystem. They can
> fetch the good copy and fix the bad one. And they have much more
> context available for recovery than a RAID would.
My problem is that the corruption I see is "silent": so
redundancy is useless, as I cannot distinguish good blocks
from bad. I'm running RAID, one of the two disks returns
bad data. Without checksums, I can't tell which version of
a block is the good one.
> Linas> I assume that a device mapper can alter the number of blocks-in
> Linas> to the number of blocks-out; that it doesn't have to be
> Linas> 1-1. Then for every 10 sectors of data, it would use 11 sectors
> Linas> of storage, one holding the checksum. I'm very naive about how
> Linas> the block layer works, so I don't know what snags there might
> Linas> be.
>
> I did a proof of concept of this a couple of years ago ago. And
> performance was pretty poor.
Yes, I'm not surprised. For a home-use system, though,
I think I'm ready to sacrifice performance in exchange for
reliability. Much of what I do does not hit the disk hard.
There is also in interesting possibility that offers a middle
ground between raw performance and safety: instead of
verifying checksums on *every* read access, it could be
enough to verify only every so often -- say, only one out
of every 10 reads, or maybe triggered by a cron job in
the middle of the night: turn on verification, touch a bunch
of files for an hour or two, turn off verification before 6AM.
This would be enough to trigger timely ill-health warnings,
without impacting daytime use. (Much as I dislike the
corruption I suffered, I dislike even more that I had no
warning of it)
> The elegant part about filesystem checksums is that they are stored in
> the metadata blocks which are read anyway.
Yes.
> So there are no additional
> seeks, nor read-modify-write on a 10 sector + 1 blob of data.
I guess that, instead of writing 10+1 sectors, with the seek
penalty, it might be faster to copy data in the kernel, so as
to be able to store the checksum in the same sector as the
data.
> So, yes. You need special hardware. Controller and disk need to
> support DIX and DIF respectively. This has been in the works for a
> while and hardware is starting to materialize. Expect this to become
> standard fare in the SCSI/SAS/FC market segment.
Yes, well, my HBA is soldered onto my MB, and I'm buying
$80 hard drives one at a time at Frye's electronics, so it could
be 5-10 years before DIX/DIF trickles down to consumer-grade
electronics. And I don't want to wait 5-10 years ...
Thus, a "tactical" solution seems to be pure-software
check-summing in a kernel device-mapper module,
performance be damned.
--linas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists