[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <20090901161828.GN4197@webber.adilger.int>
Date: Tue, 01 Sep 2009 10:18:28 -0600
From: Andreas Dilger <adilger@....com>
To: George Spelvin <linux@...izon.com>
Cc: david@...g.hm, pavel@....cz, linux-doc@...r.kernel.org,
linux-ext4@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:
On Aug 31, 2009 20:56 -0400, George Spelvin wrote:
> >> The more I learn about storage, the more I like idea of zfs. Given the
> >> subtle issues between filesystem and raid layer, integrating them just
> >> makes sense.
> >
> > Note that all that zfs does is tell you that you already lost data (and
> > then only if the checksumming algorithm would be invalid on a blank block
> > being returned), it doesn't protect your data.
>
> Obviously, there are limits, but it does provide useful protection:
> - You know where the missing data is.
> - The error isn't amplified by believing corrupted metadata
> - I seem to recall that ZFS does replicate metadata.
ZFS definitely does replicate data. At the lowest level it has RAID-1,
and RAID-Z/Z2, which are pretty close to RAID-5/6 respectively, but with
the important difference that every write is a full-stripe-width write,
so that it is not possible for RAID-Z/Z2 to cause corruption due to a
partially-written RAID parity stripe.
In addition, for internal metadata blocks there are 1 or 2 duplicate
copies written to different devices, so that in case of a fatal device
corruption (e.g. double failure of a RAID-Z device) the metadata tree
is still intact.
> - Corrupted replicas can be "scrubbed" and rewritten from uncorrupted ones.
> - If you have some storage redundancy, it can try different mirrors
> to get the data back.
>
> In particular, on a RAID-5 system, ZFS tries dropping out each data disk
> in turn to see if the correct data can be reconstructed from the others
> + parity.
What else is interesting is that in the case of 1-4-bit errors the
default checksum function can also be used as ECC to recover the correct
data even if there is no replicated copy of the data.
> One of ZFS's big performance problems is that currently it only checksums
> the entire RAID stripe, so it always has to read every drive, and doesn't
> get RAID's IOPS advantage.
Or this is a drawback of the Linux software RAID because it doesn't detect
the case when the parity is bad before there is a second drive failure and
the bad parity is used to reconstruct the data block incorrectly (which
will also go undetected because there is no checksum).
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists