linux-kernel - Re: amd64 sata_nv (massive) memory corruption

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3ae3aa420808062132x52860092p9dee56705ba99a3@mail.gmail.com>
Date:	Wed, 6 Aug 2008 23:32:06 -0500
From:	"Linas Vepstas" <linasvepstas@...il.com>
To:	"Martin K. Petersen" <martin.petersen@...cle.com>
Cc:	"Alan Cox" <alan@...rguk.ukuu.org.uk>,
	"John Stoffel" <john@...ffel.org>,
	"Alistair John Strachan" <alistair@...zero.co.uk>,
	linux-kernel@...r.kernel.org
Subject: Re: amd64 sata_nv (massive) memory corruption

2008/8/6 Martin K. Petersen <martin.petersen@...cle.com>:
>>>>>> "Linas" == Linas Vepstas <linasvepstas@...il.com> writes:
>
> [I got added to the CC: late in the game so I don't have the
> background this discussion]

You haven't missed anything, other than I've had my
umpteenth instance of data corruption in some years,
and am up to my eyeballs in consumer-grade hardware
from which I would like to get enterprise-grade reliability.
Of course, being a cheapskate is what gets me into this
mess.

> ZFS and btrfs both support redundancy within the filesystem.  They can
> fetch the good copy and fix the bad one.  And they have much more
> context available for recovery than a RAID would.

My problem is that the corruption I see is "silent": so
redundancy is useless, as I cannot distinguish good blocks
from bad.   I'm running RAID, one of the two disks returns
bad data.  Without checksums, I can't tell which version of
a block is the good one.

> Linas> I assume that a device mapper can alter the number of blocks-in
> Linas> to the number of blocks-out; that it doesn't have to be
> Linas> 1-1. Then for every 10 sectors of data, it would use 11 sectors
> Linas> of storage, one holding the checksum.  I'm very naive about how
> Linas> the block layer works, so I don't know what snags there might
> Linas> be.
>
> I did a proof of concept of this a couple of years ago ago.  And
> performance was pretty poor.

Yes, I'm not surprised. For a home-use system, though,
I think I'm ready to sacrifice performance in exchange for
reliability.  Much of what I do does not hit the disk hard.

There is also in interesting possibility that offers a middle
ground between raw performance and safety: instead of
verifying  checksums on *every* read access, it could be
enough to verify only every so often -- say, only one out
of every 10 reads, or maybe triggered by a cron job in
the middle of the night: turn on verification, touch a bunch
of files for an hour or two, turn off verification before 6AM.
This would be enough to trigger timely ill-health warnings,
without impacting daytime use.  (Much as I dislike the
corruption I suffered, I dislike even more that I had no
warning of it)

> The elegant part about filesystem checksums is that they are stored in
> the metadata blocks which are read anyway.

Yes.

> So there are no additional
> seeks, nor read-modify-write on a 10 sector + 1 blob of data.

I guess that, instead of writing 10+1 sectors, with the seek
penalty, it might be faster to copy data in the kernel, so as
to be able to store the checksum in the same sector as the
data.

> So, yes.  You need special hardware.  Controller and disk need to
> support DIX and DIF respectively.  This has been in the works for a
> while and hardware is starting to materialize.  Expect this to become
> standard fare in the SCSI/SAS/FC market segment.

Yes, well, my HBA is soldered onto my MB, and I'm buying
$80 hard drives one at a time at Frye's electronics, so it could
be 5-10 years before DIX/DIF trickles down to consumer-grade
electronics.  And I don't want to wait 5-10 years ...

Thus, a "tactical" solution seems to be pure-software
check-summing in a kernel device-mapper module,
performance be damned.

--linas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/